Caching & Cost Optimization

Strategies for drastically reducing latency and API costs in production.

Why Cache LLM Calls?

Caching Layers

Request Semantic Cache similar query? → return cached 💰 saves 100% cost miss Prompt Cache reuse system prompt prefix LLM API full call

LLM API calls are expensive and slow. If 1,000 users ask "What is your return policy?", you shouldn't generate the answer 1,000 times.

1. Exact Match Caching

Using Redis or Memcached. You hash the exact prompt and hyperparameters as the key, and store the text response as the value. Extremely fast, but fragile (a single space changes the hash).

2. Semantic Caching

Instead of matching exact strings, semantic caching uses embeddings. You embed the user's prompt, query a vector database (like Redis, Pinecone, or Milvus) for similar previous queries. If a match exceeds a similarity threshold (e.g., 0.95), you return the cached response.

Cost Optimization Strategies

  • Model Routing: Send simple tasks (summarization, extraction) to cheaper/faster models (Claude 3 Haiku, Llama 3 8B). Send complex tasks (reasoning, coding) to GPT-4o or Claude 3.5 Sonnet.
  • Prompt Minimization: Strip unnecessary whitespaces, comments, and boilerplate from the prompt to reduce token count.
  • Batch Processing: For non-real-time tasks (e.g., processing 10,000 resumes overnight), use the OpenAI/Anthropic Batch APIs for a 50% discount.

Code Example

A conceptual implementation of Semantic Caching.

python
1from sentence_transformers import SentenceTransformer
2import numpy as np
3
4# In a real app, use Redis/Pinecone instead of an in-memory list
5cache_db = []
6model = SentenceTransformer('all-MiniLM-L6-v2')
7
8def get_cached_response(user_query, threshold=0.95):
9    query_emb = model.encode(user_query)
10    
11    for cached_query, response, cached_emb in cache_db:
12        # Calculate cosine similarity
13        similarity = np.dot(query_emb, cached_emb)/(np.linalg.norm(query_emb)*np.linalg.norm(cached_emb))
14        
15        if similarity >= threshold:
16            print(f"Cache Hit! Similarity: {similarity:.2f}")
17            return response
18            
19    return None
20
21def process_query(user_query):
22    # 1. Check Cache
23    cached = get_cached_response(user_query)
24    if cached:
25        return cached
26        
27    # 2. Call LLM (Simulated)
28    print("Cache Miss. Calling expensive LLM...")
29    response = "This is the newly generated LLM response."
30    
31    # 3. Store in Cache
32    query_emb = model.encode(user_query)
33    cache_db.append((user_query, response, query_emb))
34    
35    return response

Use Cases

E-commerce FAQ bots fielding redundant questions
Reducing latency from 5000ms to 50ms for repeat code generation queries
Dynamic model routing based on user input intent

Common Mistakes

Using Semantic Caching for tasks that require real-time dynamic data (e.g. stock prices)
Setting the semantic similarity threshold too low, returning irrelevant cached answers
Failing to implement cache invalidation when underlying system knowledge changes

Interview Insight

Relevance

High - Every company wants to reduce their OpenAI bill.

AI Tutor

Ask about the topic

Sign in Required

Please sign in to use the AI tutor

Sign In