Quantization: GGUF, AWQ & EXL2
Model compression formats, precision trade-offs, and running 70B models on consumer hardware.
You Don't Need a $40,000 GPU Cluster
Quantization Precision Levels
A Llama 3 70B model in full BF16 precision requires 140GB of VRAM. No single consumer GPU has that. Quantization compresses model weights from 16-bit floating point to 4-bit or 8-bit integers, reducing memory requirements by 4-8x while retaining 95%+ of model quality.
Quantization Format Comparison
- GGUF (llama.cpp): CPU-first inference. Runs on a MacBook M3 or a Linux server without a GPU. Slow (5-10 tokens/sec) but free. Best for developers who need local prototyping.
- AWQ (Activation-aware Weight Quantization): GPU inference, 4-bit, minimal quality loss. Measures activation magnitudes first to protect salient weights during compression. Preferred for vLLM deployment.
- EXL2 (ExLlamaV2): Mixed-precision quantization assigning different bit widths per-layer based on perplexity sensitivity. Achieves the best quality-at-size trade-off across all formats. Used when quality is critical.
The Quantization-Quality Trade-off
Q4_K_M (4-bit) reduces 70B model from 140GB → 40GB. Quality loss is typically <1% on benchmarks. Q2_K (2-bit) reduces to 20GB but loses 5-8% quality. There is a hard floor: below Q4, you begin seeing significant instruction-following degradation on complex reasoning tasks.
Code Example
Three quantization approaches for different scenarios: GGUF for local development without GPU, AWQ for production vLLM deployment, EXL2 for maximum quality on a single high-end GPU.
1# Option 1: GGUF on CPU (ollama or llama.cpp)
2# Terminal: ollama run llama3:70b-instruct-q4_K_M
3import requests
4
5response = requests.post("http://localhost:11434/api/generate", json={
6 "model": "llama3:70b-instruct-q4_K_M",
7 "prompt": "Explain KV cache in 2 sentences.",
8 "stream": False
9})
10print(response.json()["response"])
11
12# Option 2: AWQ on GPU with vLLM (preferred for production)
13from vllm import LLM, SamplingParams
14
15llm = LLM(
16 model="casperhansen/llama-3-70b-instruct-awq", # Pre-quantized AWQ model
17 quantization="awq",
18 gpu_memory_utilization=0.85
19)
20
21# Option 3: Run EXL2 via ExLlamaV2 for maximum quality
22# pip install exllamav2
23from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache
24from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler
25
26config = ExLlamaV2Config()
27config.model_dir = "/models/llama3-70b-5.0bpw-exl2" # 5 bits per weight
28config.prepare()
29
30model = ExLlamaV2(config)
31cache = ExLlamaV2Cache(model, lazy=True)
32model.load_autosplit(cache)
33
34# EXL2 gives 25-35 tokens/sec on a single RTX 4090 for 70B modelUse Cases
Common Mistakes
Interview Insight
Relevance
High - Essential for cost-efficient model deployment.