Quantization: GGUF, AWQ & EXL2

You Don't Need a $40,000 GPU Cluster

Quantization Precision Levels

A Llama 3 70B model in full BF16 precision requires 140GB of VRAM. No single consumer GPU has that. Quantization compresses model weights from 16-bit floating point to 4-bit or 8-bit integers, reducing memory requirements by 4-8x while retaining 95%+ of model quality.

Quantization Format Comparison

GGUF (llama.cpp): CPU-first inference. Runs on a MacBook M3 or a Linux server without a GPU. Slow (5-10 tokens/sec) but free. Best for developers who need local prototyping.
AWQ (Activation-aware Weight Quantization): GPU inference, 4-bit, minimal quality loss. Measures activation magnitudes first to protect salient weights during compression. Preferred for vLLM deployment.
EXL2 (ExLlamaV2): Mixed-precision quantization assigning different bit widths per-layer based on perplexity sensitivity. Achieves the best quality-at-size trade-off across all formats. Used when quality is critical.

The Quantization-Quality Trade-off

Q4_K_M (4-bit) reduces 70B model from 140GB → 40GB. Quality loss is typically <1% on benchmarks. Q2_K (2-bit) reduces to 20GB but loses 5-8% quality. There is a hard floor: below Q4, you begin seeing significant instruction-following degradation on complex reasoning tasks.

Code Example

Three quantization approaches for different scenarios: GGUF for local development without GPU, AWQ for production vLLM deployment, EXL2 for maximum quality on a single high-end GPU.

python

1# Option 1: GGUF on CPU (ollama or llama.cpp)
2# Terminal: ollama run llama3:70b-instruct-q4_K_M
3import requests
4
5response = requests.post("http://localhost:11434/api/generate", json={
6    "model": "llama3:70b-instruct-q4_K_M",
7    "prompt": "Explain KV cache in 2 sentences.",
8    "stream": False
9})
10print(response.json()["response"])
11
12# Option 2: AWQ on GPU with vLLM (preferred for production)
13from vllm import LLM, SamplingParams
14
15llm = LLM(
16    model="casperhansen/llama-3-70b-instruct-awq",  # Pre-quantized AWQ model
17    quantization="awq",
18    gpu_memory_utilization=0.85
19)
20
21# Option 3: Run EXL2 via ExLlamaV2 for maximum quality
22# pip install exllamav2
23from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache
24from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler
25
26config = ExLlamaV2Config()
27config.model_dir = "/models/llama3-70b-5.0bpw-exl2"  # 5 bits per weight
28config.prepare()
29
30model = ExLlamaV2(config)
31cache = ExLlamaV2Cache(model, lazy=True)
32model.load_autosplit(cache)
33
34# EXL2 gives 25-35 tokens/sec on a single RTX 4090 for 70B model

Use Cases

Local development with Llama 70B on a Mac Studio M2 Ultra using GGUF

Production serving with 4x quality-per-dollar improvement using AWQ on A100s

Offline inference on edge devices or air-gapped environments

Common Mistakes

Using Q2 or Q3 quantization for instruction-following tasks — the model loses coherent reasoning ability below Q4

Mixing quantization formats (AWQ model with GPTQ-expecting server) — always verify the quantization format matches your inference server

Not benchmarking quantized model quality on your specific task before deploying — synthetic benchmarks don't always match domain performance

Interview Insight

Relevance

High - Essential for cost-efficient model deployment.

LLM Foundations

Advanced Prompt Engineering

RAG & Vector Databases

Building AI Agents

AI Engineering Stack

Advanced RAG Engineering

LLM Inference Engineering

Fine-Tuning & Model Alignment

Context & Memory Management