Caching & Cost Optimization

Why Cache LLM Calls?

Caching Layers

LLM API calls are expensive and slow. If 1,000 users ask "What is your return policy?", you shouldn't generate the answer 1,000 times.

1. Exact Match Caching

Using Redis or Memcached. You hash the exact prompt and hyperparameters as the key, and store the text response as the value. Extremely fast, but fragile (a single space changes the hash).

2. Semantic Caching

Instead of matching exact strings, semantic caching uses embeddings. You embed the user's prompt, query a vector database (like Redis, Pinecone, or Milvus) for similar previous queries. If a match exceeds a similarity threshold (e.g., 0.95), you return the cached response.

Cost Optimization Strategies

Model Routing: Send simple tasks (summarization, extraction) to cheaper/faster models (Claude 3 Haiku, Llama 3 8B). Send complex tasks (reasoning, coding) to GPT-4o or Claude 3.5 Sonnet.
Prompt Minimization: Strip unnecessary whitespaces, comments, and boilerplate from the prompt to reduce token count.
Batch Processing: For non-real-time tasks (e.g., processing 10,000 resumes overnight), use the OpenAI/Anthropic Batch APIs for a 50% discount.

Code Example

A conceptual implementation of Semantic Caching.

python

1from sentence_transformers import SentenceTransformer
2import numpy as np
3
4# In a real app, use Redis/Pinecone instead of an in-memory list
5cache_db = []
6model = SentenceTransformer('all-MiniLM-L6-v2')
7
8def get_cached_response(user_query, threshold=0.95):
9    query_emb = model.encode(user_query)
10    
11    for cached_query, response, cached_emb in cache_db:
12        # Calculate cosine similarity
13        similarity = np.dot(query_emb, cached_emb)/(np.linalg.norm(query_emb)*np.linalg.norm(cached_emb))
14        
15        if similarity >= threshold:
16            print(f"Cache Hit! Similarity: {similarity:.2f}")
17            return response
18            
19    return None
20
21def process_query(user_query):
22    # 1. Check Cache
23    cached = get_cached_response(user_query)
24    if cached:
25        return cached
26        
27    # 2. Call LLM (Simulated)
28    print("Cache Miss. Calling expensive LLM...")
29    response = "This is the newly generated LLM response."
30    
31    # 3. Store in Cache
32    query_emb = model.encode(user_query)
33    cache_db.append((user_query, response, query_emb))
34    
35    return response

Use Cases

E-commerce FAQ bots fielding redundant questions

Reducing latency from 5000ms to 50ms for repeat code generation queries

Dynamic model routing based on user input intent

Common Mistakes

Using Semantic Caching for tasks that require real-time dynamic data (e.g. stock prices)

Setting the semantic similarity threshold too low, returning irrelevant cached answers

Failing to implement cache invalidation when underlying system knowledge changes

Interview Insight

Relevance

High - Every company wants to reduce their OpenAI bill.

LLM Foundations

Advanced Prompt Engineering

RAG & Vector Databases

Building AI Agents

AI Engineering Stack

Advanced RAG Engineering

LLM Inference Engineering

Fine-Tuning & Model Alignment

Context & Memory Management