Quantization: Running Large Models on Consumer GPUs¶

Quantization is the technique that makes it possible to run a 24-billion parameter model on a single RTX 4090. Without it, Mistral-Small-24B would need ~48GB of VRAM — double what even the highest-end consumer GPU provides.

This page explains what quantization is, compares the major formats, and explains why Zorac defaults to AWQ with Marlin kernels.

Why Quantization Matters¶

A neural network's "parameters" are just numbers — weights that define what the model has learned. In their original training format, these weights are stored as 16-bit floating point (FP16/BF16) values. Each parameter takes 2 bytes of memory.

For a 24B parameter model:

24,000,000,000 parameters x 2 bytes = 48 GB

An RTX 4090 has 24GB of VRAM. That's only half of what we need — and we haven't even accounted for the KV cache (the memory used to track conversation context during inference).

Quantization reduces the precision of these weights — from 16 bits down to 8, 4, or even 2 bits per weight — shrinking memory requirements proportionally:

Precision	Bytes/Param	24B Model Size	Fits in 24GB?
FP16 (original)	2.0	~48 GB	No
INT8	1.0	~24 GB	Barely
INT4 (AWQ/GPTQ)	0.5	~12 GB	Yes, with room for KV cache

At 4-bit precision, our 24B model only needs ~12GB of VRAM, leaving ~12GB for the KV cache (context window) and overhead. That's the sweet spot.

How Quantization Works (Simplified)¶

Think of quantization like reducing the color depth of an image. A 24-bit color image can represent 16.7 million colors. Reduce it to 8-bit and you get 256 colors — the image looks nearly identical for most content, but takes a fraction of the space.

Similarly, neural network weights don't need full FP16 precision. Most weights cluster in a narrow range, and the model's behavior is robust to small rounding errors.

The key insight of modern quantization methods is that not all weights are equally important. Some weights have an outsized impact on the model's output — quantizing those carelessly causes noticeable quality loss. Smart quantization methods identify and protect these critical weights.

Quantization Formats Compared¶

AWQ (Activation-aware Weight Quantization)¶

AWQ identifies the most important weights by looking at activation patterns — how the weights interact with typical inputs. Weights connected to frequently-activated neurons are kept at higher precision (or scaled to preserve their effect), while less critical weights are more aggressively quantized.

Strengths:

Excellent quality preservation — nearly indistinguishable from FP16 on benchmarks
Fast inference with GPU-optimized kernels (especially Marlin)
Well-supported by vLLM and other inference servers
Smaller file sizes than GPTQ for equivalent quality

Best for: GPU inference with vLLM, TGI, or similar servers.

GPTQ (GPT Quantization)¶

GPTQ uses a one-shot weight quantization method based on approximate second-order information. It processes weights layer by layer, adjusting remaining weights to compensate for quantization error.

Strengths:

Mature format with wide tool support
Good quality at 4-bit precision
Compatible with most inference frameworks

Weaknesses compared to AWQ:

Slightly slower inference on modern GPUs
Larger file sizes for equivalent quality
Less efficient kernel implementations

Best for: Broad compatibility when AWQ isn't available.

GGUF (GPT-Generated Unified Format)¶

GGUF is the native format for llama.cpp and is designed for CPU and mixed CPU/GPU inference. It supports a wide range of quantization levels (Q2 through Q8) and can split model layers between CPU RAM and GPU VRAM.

Strengths:

Runs on CPU (no GPU required)
Flexible memory placement (CPU + GPU hybrid)
Wide range of quantization levels
Great for Mac M-series (via Metal) and systems without NVIDIA GPUs

Weaknesses for our use case:

Significantly slower on NVIDIA GPUs compared to AWQ/GPTQ with optimized kernels
vLLM's GGUF support is limited and not optimized for throughput
Real-world testing shows ~6 tok/s vs ~60 tok/s with AWQ Marlin on RTX 4090

Best for: CPU inference, Mac M-series, systems without dedicated NVIDIA GPUs.

Format Comparison at a Glance¶

Feature	AWQ	GPTQ	GGUF
Target hardware	NVIDIA GPU	NVIDIA GPU	CPU / Mixed
vLLM performance	60-65 tok/s	45-55 tok/s	~6 tok/s
Quality at 4-bit	Excellent	Good	Good
Kernel optimization	Marlin (Ada/Ampere)	ExLlama v2	llama.cpp
File size (24B model)	~14 GB	~15 GB	~14 GB (Q4)
Setup complexity	Low	Low	Medium

Why Zorac Defaults to AWQ + Marlin¶

Zorac uses stelterlab/Mistral-Small-24B-Instruct-2501-AWQ as its default model. Here's why:

1. Performance: Marlin Kernels¶

The awq_marlin quantization backend in vLLM uses Marlin kernels — GPU kernels specifically optimized for 4-bit matrix multiplication on NVIDIA Ada Lovelace (RTX 40-series) and Ampere (RTX 30-series) architectures.

The performance difference is dramatic:

Standard AWQ kernel:    ~6 tok/s   (unusable for interactive chat)
AWQ + Marlin kernel:   ~60-65 tok/s (smooth, responsive conversation)

This 10x speedup comes from Marlin's optimized memory access patterns and CUDA kernel design. It's the single most important configuration flag in the vLLM setup:

vllm serve "model-name" --quantization awq_marlin

Don't forget the _marlin suffix

Using --quantization awq (without _marlin) falls back to the generic AWQ kernel, which is dramatically slower. Always use awq_marlin on RTX 30/40-series GPUs.

2. Quality: Activation-Aware Preservation¶

AWQ's activation-aware approach preserves model quality better than naive quantization. For Mistral-Small-24B, the AWQ version performs within 1-2% of the FP16 original on standard benchmarks — a difference that's imperceptible in conversational use.

3. Memory Budget¶

At 4-bit precision, the model occupies ~14GB of the RTX 4090's 24GB VRAM:

Model weights:     ~14 GB
KV cache:          ~8 GB  (with gpu-memory-utilization=0.85)
CUDA overhead:     ~2 GB
─────────────────────────
Total:             ~24 GB

This leaves enough KV cache memory for a 16,384-token context window — sufficient for extended conversations with Zorac's automatic summarization keeping things within budget.

When to Choose Each Format¶

Use this decision framework when selecting a quantization format for your own projects:

graph TD
    A[What hardware?] -->|NVIDIA GPU| B[Need max throughput?]
    A -->|CPU / Mac M-series| C[Use GGUF]
    B -->|Yes| D[Use AWQ + Marlin]
    B -->|Compatibility concerns| E[Use GPTQ]
    D --> F[vLLM with --quantization awq_marlin]
    E --> G[vLLM or TGI with GPTQ model]
    C --> H[llama.cpp or Ollama]

Choose AWQ when:

You have an NVIDIA RTX 30/40-series GPU
You're using vLLM as your inference server
Interactive performance matters (chat, coding assistants)

Choose GPTQ when:

AWQ version of your model doesn't exist
You need compatibility with a specific inference framework
You're using ExLlama v2 backend

Choose GGUF when:

You don't have an NVIDIA GPU
You want CPU or mixed CPU/GPU inference
You're running on Mac with Apple Silicon
You're using Ollama or llama.cpp

Quantization: Running Large Models on Consumer GPUs¶

Why Quantization Matters¶

How Quantization Works (Simplified)¶

Quantization Formats Compared¶

AWQ (Activation-aware Weight Quantization)¶

GPTQ (GPT Quantization)¶

GGUF (GPT-Generated Unified Format)¶

Format Comparison at a Glance¶

Why Zorac Defaults to AWQ + Marlin¶

1. Performance: Marlin Kernels¶

2. Quality: Activation-Aware Preservation¶

3. Memory Budget¶

When to Choose Each Format¶

Further Reading¶