Quantization: GGUF, AWQ & EXL2

Model compression formats, precision trade-offs, and running 70B models on consumer hardware.

You Don't Need a $40,000 GPU Cluster

Quantization Precision Levels

FP32 140GB for 70B 4 bytes/param ❌ Too large FP16 70GB for 70B 2 bytes/param ⚠️ Standard INT8 35GB for 70B 1 byte/param ✅ ~1% loss INT4 (GPTQ) 17GB for 70B 0.5 byte/param 🚀 Fits 1 GPU!

A Llama 3 70B model in full BF16 precision requires 140GB of VRAM. No single consumer GPU has that. Quantization compresses model weights from 16-bit floating point to 4-bit or 8-bit integers, reducing memory requirements by 4-8x while retaining 95%+ of model quality.

Quantization Format Comparison

  • GGUF (llama.cpp): CPU-first inference. Runs on a MacBook M3 or a Linux server without a GPU. Slow (5-10 tokens/sec) but free. Best for developers who need local prototyping.
  • AWQ (Activation-aware Weight Quantization): GPU inference, 4-bit, minimal quality loss. Measures activation magnitudes first to protect salient weights during compression. Preferred for vLLM deployment.
  • EXL2 (ExLlamaV2): Mixed-precision quantization assigning different bit widths per-layer based on perplexity sensitivity. Achieves the best quality-at-size trade-off across all formats. Used when quality is critical.

The Quantization-Quality Trade-off

Q4_K_M (4-bit) reduces 70B model from 140GB → 40GB. Quality loss is typically <1% on benchmarks. Q2_K (2-bit) reduces to 20GB but loses 5-8% quality. There is a hard floor: below Q4, you begin seeing significant instruction-following degradation on complex reasoning tasks.

Code Example

Three quantization approaches for different scenarios: GGUF for local development without GPU, AWQ for production vLLM deployment, EXL2 for maximum quality on a single high-end GPU.

python
1# Option 1: GGUF on CPU (ollama or llama.cpp)
2# Terminal: ollama run llama3:70b-instruct-q4_K_M
3import requests
4
5response = requests.post("http://localhost:11434/api/generate", json={
6    "model": "llama3:70b-instruct-q4_K_M",
7    "prompt": "Explain KV cache in 2 sentences.",
8    "stream": False
9})
10print(response.json()["response"])
11
12# Option 2: AWQ on GPU with vLLM (preferred for production)
13from vllm import LLM, SamplingParams
14
15llm = LLM(
16    model="casperhansen/llama-3-70b-instruct-awq",  # Pre-quantized AWQ model
17    quantization="awq",
18    gpu_memory_utilization=0.85
19)
20
21# Option 3: Run EXL2 via ExLlamaV2 for maximum quality
22# pip install exllamav2
23from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache
24from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler
25
26config = ExLlamaV2Config()
27config.model_dir = "/models/llama3-70b-5.0bpw-exl2"  # 5 bits per weight
28config.prepare()
29
30model = ExLlamaV2(config)
31cache = ExLlamaV2Cache(model, lazy=True)
32model.load_autosplit(cache)
33
34# EXL2 gives 25-35 tokens/sec on a single RTX 4090 for 70B model

Use Cases

Local development with Llama 70B on a Mac Studio M2 Ultra using GGUF
Production serving with 4x quality-per-dollar improvement using AWQ on A100s
Offline inference on edge devices or air-gapped environments

Common Mistakes

Using Q2 or Q3 quantization for instruction-following tasks — the model loses coherent reasoning ability below Q4
Mixing quantization formats (AWQ model with GPTQ-expecting server) — always verify the quantization format matches your inference server
Not benchmarking quantized model quality on your specific task before deploying — synthetic benchmarks don't always match domain performance

Interview Insight

Relevance

High - Essential for cost-efficient model deployment.

AI Tutor

Ask about the topic

Sign in Required

Please sign in to use the AI tutor

Sign In