Prompt Caching & Speculative Decoding
Anthropic/Gemini prefix caching, 90% cost reduction, and speculative token generation.
Stop Paying to Process the Same System Prompt 10,000 Times
Prompt Caching Flow
If your system prompt is 2,000 tokens and you have 100,000 requests/day, you are spending $6/day ($2,160/year) just on re-processing the same static context. Prompt Prefix Caching lets Anthropic and Google compute and cache your static context on their infrastructure, so you only pay for input tokens once and cache reads at 90% discount.
How Prefix Caching Works
The model provider hashes the prefix (your system prompt + any static context). On subsequent requests with the same prefix, the KV cache for those tokens is served from their GPU memory instead of recomputed. The constraint is that the prefix must be identical byte-for-byte. Even a single character change invalidates the cache. This is why dynamic content (user names, timestamps) must always be appended after the cached prefix, never injected inside it.
Speculative Decoding
Standard decoding is sequential and slow: 1 token per compute step. Speculative Decoding uses a small, fast "draft" model (e.g. Llama 3 8B) to generate 4-8 tokens in parallel, then uses the large "oracle" model (e.g. Llama 3 70B) to verify them all in a single forward pass. If the draft tokens match what the large model would have generated, all are accepted simultaneously. This achieves 2-3x throughput improvement for free with no quality loss.
Code Example
Anthropic prompt caching with cache_control. The system prompt and knowledge base are cached after the first call. Every subsequent call reads the cache at 10% of normal cost with significantly lower time-to-first-byte latency.
1import anthropic
2
3client = anthropic.Anthropic()
4
5# Load a large static document (e.g. a 100-page PDF as text)
6with open("company_knowledge_base.txt", "r") as f:
7 knowledge_base = f.read() # 50,000 tokens of static content
8
9def query_with_caching(user_question: str):
10 """Use cache_control to cache the expensive static context."""
11
12 response = client.messages.create(
13 model="claude-3-5-sonnet-20241022",
14 max_tokens=1024,
15 system=[
16 {
17 "type": "text",
18 "text": "You are a helpful enterprise assistant.",
19 },
20 {
21 "type": "text",
22 "text": knowledge_base,
23 "cache_control": {"type": "ephemeral"} # Cache this block!
24 }
25 ],
26 messages=[{
27 "role": "user",
28 "content": user_question
29 }]
30 )
31
32 # Check cache performance in response headers
33 usage = response.usage
34 print(f"Input tokens (charged): {usage.input_tokens}")
35 print(f"Cache write tokens (1st call): {usage.cache_creation_input_tokens}")
36 print(f"Cache read tokens (90% discount): {usage.cache_read_input_tokens}")
37
38 # 1st call: pays for 50,000 cache_creation_input_tokens (full price)
39 # All subsequent calls: pays for 50,000 cache_read_input_tokens (10% price!)
40
41 return response.content[0].text
42
43# First call - writes cache (full price)
44answer1 = query_with_caching("What is our refund policy?")
45
46# Second call - reads cache (90% cheaper, ~5x faster TTFB)
47answer2 = query_with_caching("What are our shipping regions?")Use Cases
Common Mistakes
Interview Insight
Relevance
High - Directly reduces costs and latency for production systems.