vLLM & PagedAttention
Self-hosting open-source LLMs, continuous batching, and CUDA memory management.
Why $0.01/1K Tokens Kills Your Margin
vLLM PagedAttention Architecture
At scale, OpenAI API costs become your biggest infrastructure expense. A team processing 10M tokens/day at GPT-4o pricing spends $150,000/month. At that point, self-hosting an open-source model (like Llama 3 70B) on a cluster of H100 GPUs via vLLM is almost always cheaper and gives you full control over the model, data, and throughput.
PagedAttention: CUDA KV Cache Management
The fundamental problem in GPU inference is KV cache memory fragmentation. With standard PyTorch, when 8 concurrent requests each have different sequence lengths, you must pre-allocate the maximum possible KV cache size for all of them, wasting up to 60-80% of VRAM.
PagedAttention (vLLM's core innovation) borrows the OS virtual memory paging concept. KV cache vectors are stored in fixed-size pages (blocks) which are dynamically allocated on demand. Multiple sequences can share non-contiguous physical KV pages, eliminating fragmentation and enabling 3-4x higher throughput on the same hardware.
Continuous Batching vs. Static Batching
Static batching waits until a full batch of requests arrives before running an inference step. If Request A finishes in 10 tokens and Request B needs 500, the GPU sits idle waiting for B. Continuous batching (also called iteration-level scheduling) inserts new requests mid-batch the moment a slot frees up. This is why vLLM's throughput can be 24x higher than naive Hugging Face inference.
Code Example
vLLM batch inference with AWQ quantization and tensor parallelism across 2 GPUs. The critical insight: vLLM's API is OpenAI-compatible, so you can switch from OpenAI to self-hosted Llama by changing just the base_url.
1# Install: pip install vllm
2from vllm import LLM, SamplingParams
3
4# Load Llama 3 70B with 4-bit quantization to fit on 2x A100 80GB GPUs
5llm = LLM(
6 model="meta-llama/Meta-Llama-3-70B-Instruct",
7 tensor_parallel_size=2, # Shard model across 2 GPUs
8 quantization="awq", # 4-bit AWQ quantization
9 dtype="float16",
10 gpu_memory_utilization=0.90, # Use 90% of VRAM for weights + KV cache
11 max_model_len=8192 # Max context window size
12)
13
14# Batch inference — vLLM auto-batches and continuous-schedules these
15prompts = [
16 "Explain transformer architecture in one paragraph.",
17 "Write a Python function to merge two sorted arrays.",
18 "What are the trade-offs between SQL and NoSQL databases?"
19]
20
21sampling_params = SamplingParams(
22 temperature=0.7,
23 top_p=0.9,
24 max_tokens=512
25)
26
27# All 3 prompts processed concurrently with PagedAttention
28outputs = llm.generate(prompts, sampling_params)
29
30for output in outputs:
31 print(f"Prompt: {output.prompt[:50]}...")
32 print(f"Generated {len(output.outputs[0].token_ids)} tokens")
33 print(f"Output: {output.outputs[0].text[:100]}...\n")
34
35# Expose as OpenAI-compatible API server:
36# $ python -m vllm.entrypoints.openai.api_server \
37# --model meta-llama/Meta-Llama-3-70B-Instruct \
38# --quantization awq \
39# --tensor-parallel-size 2Use Cases
Common Mistakes
Interview Insight
Relevance
High - Required for any team building cost-sensitive AI products.