Caching & Cost Optimization
Strategies for drastically reducing latency and API costs in production.
Why Cache LLM Calls?
Caching Layers
LLM API calls are expensive and slow. If 1,000 users ask "What is your return policy?", you shouldn't generate the answer 1,000 times.
1. Exact Match Caching
Using Redis or Memcached. You hash the exact prompt and hyperparameters as the key, and store the text response as the value. Extremely fast, but fragile (a single space changes the hash).
2. Semantic Caching
Instead of matching exact strings, semantic caching uses embeddings. You embed the user's prompt, query a vector database (like Redis, Pinecone, or Milvus) for similar previous queries. If a match exceeds a similarity threshold (e.g., 0.95), you return the cached response.
Cost Optimization Strategies
- Model Routing: Send simple tasks (summarization, extraction) to cheaper/faster models (Claude 3 Haiku, Llama 3 8B). Send complex tasks (reasoning, coding) to GPT-4o or Claude 3.5 Sonnet.
- Prompt Minimization: Strip unnecessary whitespaces, comments, and boilerplate from the prompt to reduce token count.
- Batch Processing: For non-real-time tasks (e.g., processing 10,000 resumes overnight), use the OpenAI/Anthropic Batch APIs for a 50% discount.
Code Example
A conceptual implementation of Semantic Caching.
1from sentence_transformers import SentenceTransformer
2import numpy as np
3
4# In a real app, use Redis/Pinecone instead of an in-memory list
5cache_db = []
6model = SentenceTransformer('all-MiniLM-L6-v2')
7
8def get_cached_response(user_query, threshold=0.95):
9 query_emb = model.encode(user_query)
10
11 for cached_query, response, cached_emb in cache_db:
12 # Calculate cosine similarity
13 similarity = np.dot(query_emb, cached_emb)/(np.linalg.norm(query_emb)*np.linalg.norm(cached_emb))
14
15 if similarity >= threshold:
16 print(f"Cache Hit! Similarity: {similarity:.2f}")
17 return response
18
19 return None
20
21def process_query(user_query):
22 # 1. Check Cache
23 cached = get_cached_response(user_query)
24 if cached:
25 return cached
26
27 # 2. Call LLM (Simulated)
28 print("Cache Miss. Calling expensive LLM...")
29 response = "This is the newly generated LLM response."
30
31 # 3. Store in Cache
32 query_emb = model.encode(user_query)
33 cache_db.append((user_query, response, query_emb))
34
35 return responseUse Cases
Common Mistakes
Interview Insight
Relevance
High - Every company wants to reduce their OpenAI bill.