Prompt Caching & Speculative Decoding

Stop Paying to Process the Same System Prompt 10,000 Times

Prompt Caching Flow

If your system prompt is 2,000 tokens and you have 100,000 requests/day, you are spending $6/day ($2,160/year) just on re-processing the same static context. Prompt Prefix Caching lets Anthropic and Google compute and cache your static context on their infrastructure, so you only pay for input tokens once and cache reads at 90% discount.

How Prefix Caching Works

The model provider hashes the prefix (your system prompt + any static context). On subsequent requests with the same prefix, the KV cache for those tokens is served from their GPU memory instead of recomputed. The constraint is that the prefix must be identical byte-for-byte. Even a single character change invalidates the cache. This is why dynamic content (user names, timestamps) must always be appended after the cached prefix, never injected inside it.

Speculative Decoding

Standard decoding is sequential and slow: 1 token per compute step. Speculative Decoding uses a small, fast "draft" model (e.g. Llama 3 8B) to generate 4-8 tokens in parallel, then uses the large "oracle" model (e.g. Llama 3 70B) to verify them all in a single forward pass. If the draft tokens match what the large model would have generated, all are accepted simultaneously. This achieves 2-3x throughput improvement for free with no quality loss.

Code Example

Anthropic prompt caching with cache_control. The system prompt and knowledge base are cached after the first call. Every subsequent call reads the cache at 10% of normal cost with significantly lower time-to-first-byte latency.

python

1import anthropic
2
3client = anthropic.Anthropic()
4
5# Load a large static document (e.g. a 100-page PDF as text)
6with open("company_knowledge_base.txt", "r") as f:
7    knowledge_base = f.read()  # 50,000 tokens of static content
8
9def query_with_caching(user_question: str):
10    """Use cache_control to cache the expensive static context."""
11    
12    response = client.messages.create(
13        model="claude-3-5-sonnet-20241022",
14        max_tokens=1024,
15        system=[
16            {
17                "type": "text",
18                "text": "You are a helpful enterprise assistant.",
19            },
20            {
21                "type": "text",
22                "text": knowledge_base,
23                "cache_control": {"type": "ephemeral"}  # Cache this block!
24            }
25        ],
26        messages=[{
27            "role": "user",
28            "content": user_question
29        }]
30    )
31    
32    # Check cache performance in response headers
33    usage = response.usage
34    print(f"Input tokens (charged): {usage.input_tokens}")
35    print(f"Cache write tokens (1st call): {usage.cache_creation_input_tokens}")
36    print(f"Cache read tokens (90% discount): {usage.cache_read_input_tokens}")
37    
38    # 1st call: pays for 50,000 cache_creation_input_tokens (full price)
39    # All subsequent calls: pays for 50,000 cache_read_input_tokens (10% price!)
40    
41    return response.content[0].text
42
43# First call - writes cache (full price)
44answer1 = query_with_caching("What is our refund policy?")
45
46# Second call - reads cache (90% cheaper, ~5x faster TTFB)
47answer2 = query_with_caching("What are our shipping regions?")

Use Cases

Reducing cost by 80-90% for chatbots where the system prompt and knowledge base are static

Enabling large context windows (200K tokens) economically for document analysis apps

Video/code analysis where the media is the same across many analysis requests

Common Mistakes

Injecting the current timestamp or user ID inside the cached prefix block — this invalidates the cache on every single request

Expecting the cache to persist indefinitely — Anthropic ephemeral cache expires after 5 minutes of no activity

Not checking cache_read_input_tokens in the response to verify the cache is actually being hit

Interview Insight

Relevance

High - Directly reduces costs and latency for production systems.

LLM Foundations

Advanced Prompt Engineering

RAG & Vector Databases

Building AI Agents

AI Engineering Stack

Advanced RAG Engineering

LLM Inference Engineering

Fine-Tuning & Model Alignment

Context & Memory Management