Prompt Caching & Speculative Decoding

Anthropic/Gemini prefix caching, 90% cost reduction, and speculative token generation.

Stop Paying to Process the Same System Prompt 10,000 Times

Prompt Caching Flow

System Prompt (2000+ tokens) same every request KV Cache Pre-computed attention states 💰 90% cheaper! + User Query LLM fast!

If your system prompt is 2,000 tokens and you have 100,000 requests/day, you are spending $6/day ($2,160/year) just on re-processing the same static context. Prompt Prefix Caching lets Anthropic and Google compute and cache your static context on their infrastructure, so you only pay for input tokens once and cache reads at 90% discount.

How Prefix Caching Works

The model provider hashes the prefix (your system prompt + any static context). On subsequent requests with the same prefix, the KV cache for those tokens is served from their GPU memory instead of recomputed. The constraint is that the prefix must be identical byte-for-byte. Even a single character change invalidates the cache. This is why dynamic content (user names, timestamps) must always be appended after the cached prefix, never injected inside it.

Speculative Decoding

Standard decoding is sequential and slow: 1 token per compute step. Speculative Decoding uses a small, fast "draft" model (e.g. Llama 3 8B) to generate 4-8 tokens in parallel, then uses the large "oracle" model (e.g. Llama 3 70B) to verify them all in a single forward pass. If the draft tokens match what the large model would have generated, all are accepted simultaneously. This achieves 2-3x throughput improvement for free with no quality loss.

Code Example

Anthropic prompt caching with cache_control. The system prompt and knowledge base are cached after the first call. Every subsequent call reads the cache at 10% of normal cost with significantly lower time-to-first-byte latency.

python
1import anthropic
2
3client = anthropic.Anthropic()
4
5# Load a large static document (e.g. a 100-page PDF as text)
6with open("company_knowledge_base.txt", "r") as f:
7    knowledge_base = f.read()  # 50,000 tokens of static content
8
9def query_with_caching(user_question: str):
10    """Use cache_control to cache the expensive static context."""
11    
12    response = client.messages.create(
13        model="claude-3-5-sonnet-20241022",
14        max_tokens=1024,
15        system=[
16            {
17                "type": "text",
18                "text": "You are a helpful enterprise assistant.",
19            },
20            {
21                "type": "text",
22                "text": knowledge_base,
23                "cache_control": {"type": "ephemeral"}  # Cache this block!
24            }
25        ],
26        messages=[{
27            "role": "user",
28            "content": user_question
29        }]
30    )
31    
32    # Check cache performance in response headers
33    usage = response.usage
34    print(f"Input tokens (charged): {usage.input_tokens}")
35    print(f"Cache write tokens (1st call): {usage.cache_creation_input_tokens}")
36    print(f"Cache read tokens (90% discount): {usage.cache_read_input_tokens}")
37    
38    # 1st call: pays for 50,000 cache_creation_input_tokens (full price)
39    # All subsequent calls: pays for 50,000 cache_read_input_tokens (10% price!)
40    
41    return response.content[0].text
42
43# First call - writes cache (full price)
44answer1 = query_with_caching("What is our refund policy?")
45
46# Second call - reads cache (90% cheaper, ~5x faster TTFB)
47answer2 = query_with_caching("What are our shipping regions?")

Use Cases

Reducing cost by 80-90% for chatbots where the system prompt and knowledge base are static
Enabling large context windows (200K tokens) economically for document analysis apps
Video/code analysis where the media is the same across many analysis requests

Common Mistakes

Injecting the current timestamp or user ID inside the cached prefix block — this invalidates the cache on every single request
Expecting the cache to persist indefinitely — Anthropic ephemeral cache expires after 5 minutes of no activity
Not checking cache_read_input_tokens in the response to verify the cache is actually being hit

Interview Insight

Relevance

High - Directly reduces costs and latency for production systems.

AI Tutor

Ask about the topic

Sign in Required

Please sign in to use the AI tutor

Sign In