vLLM & PagedAttention

Self-hosting open-source LLMs, continuous batching, and CUDA memory management.

Why $0.01/1K Tokens Kills Your Margin

vLLM PagedAttention Architecture

Traditional: Contiguous KV Cache ▲ wasted memory (fragmentation) vLLM: Paged KV Cache ✓ no waste (virtual pages) Continuous Batching: new requests fill gaps dynamically → 24x throughput vLLM serves 1000s of concurrent users on a single GPU

At scale, OpenAI API costs become your biggest infrastructure expense. A team processing 10M tokens/day at GPT-4o pricing spends $150,000/month. At that point, self-hosting an open-source model (like Llama 3 70B) on a cluster of H100 GPUs via vLLM is almost always cheaper and gives you full control over the model, data, and throughput.

PagedAttention: CUDA KV Cache Management

The fundamental problem in GPU inference is KV cache memory fragmentation. With standard PyTorch, when 8 concurrent requests each have different sequence lengths, you must pre-allocate the maximum possible KV cache size for all of them, wasting up to 60-80% of VRAM.

PagedAttention (vLLM's core innovation) borrows the OS virtual memory paging concept. KV cache vectors are stored in fixed-size pages (blocks) which are dynamically allocated on demand. Multiple sequences can share non-contiguous physical KV pages, eliminating fragmentation and enabling 3-4x higher throughput on the same hardware.

Continuous Batching vs. Static Batching

Static batching waits until a full batch of requests arrives before running an inference step. If Request A finishes in 10 tokens and Request B needs 500, the GPU sits idle waiting for B. Continuous batching (also called iteration-level scheduling) inserts new requests mid-batch the moment a slot frees up. This is why vLLM's throughput can be 24x higher than naive Hugging Face inference.

Code Example

vLLM batch inference with AWQ quantization and tensor parallelism across 2 GPUs. The critical insight: vLLM's API is OpenAI-compatible, so you can switch from OpenAI to self-hosted Llama by changing just the base_url.

python
1# Install: pip install vllm
2from vllm import LLM, SamplingParams
3
4# Load Llama 3 70B with 4-bit quantization to fit on 2x A100 80GB GPUs
5llm = LLM(
6    model="meta-llama/Meta-Llama-3-70B-Instruct",
7    tensor_parallel_size=2,       # Shard model across 2 GPUs
8    quantization="awq",           # 4-bit AWQ quantization
9    dtype="float16",
10    gpu_memory_utilization=0.90,  # Use 90% of VRAM for weights + KV cache
11    max_model_len=8192            # Max context window size
12)
13
14# Batch inference — vLLM auto-batches and continuous-schedules these
15prompts = [
16    "Explain transformer architecture in one paragraph.",
17    "Write a Python function to merge two sorted arrays.",
18    "What are the trade-offs between SQL and NoSQL databases?"
19]
20
21sampling_params = SamplingParams(
22    temperature=0.7,
23    top_p=0.9,
24    max_tokens=512
25)
26
27# All 3 prompts processed concurrently with PagedAttention
28outputs = llm.generate(prompts, sampling_params)
29
30for output in outputs:
31    print(f"Prompt: {output.prompt[:50]}...")
32    print(f"Generated {len(output.outputs[0].token_ids)} tokens")
33    print(f"Output: {output.outputs[0].text[:100]}...\n")
34
35# Expose as OpenAI-compatible API server:
36# $ python -m vllm.entrypoints.openai.api_server \
37#     --model meta-llama/Meta-Llama-3-70B-Instruct \
38#     --quantization awq \
39#     --tensor-parallel-size 2

Use Cases

Self-hosting Llama 3, Mistral, or Qwen for data privacy compliance (no data leaving your network)
Reducing inference costs by 10-20x vs. OpenAI/Anthropic at high volume
Building on-premise AI for regulated industries (Healthcare, Finance, Government)

Common Mistakes

Running Llama 70B in full bfloat16 without quantization — requires 140GB VRAM (2x H100). Use AWQ or GPTQ quantization.
Using Hugging Face pipeline() for production inference — it uses static batching and will be 10x slower than vLLM
Ignoring tensor_parallel_size — if you have 4 GPUs, use all 4 for lower latency per request

Interview Insight

Relevance

High - Required for any team building cost-sensitive AI products.

AI Tutor

Ask about the topic

Sign in Required

Please sign in to use the AI tutor

Sign In