Why Naive RAG Fails in Production
Bi-Encoder/Cross-Encoder pipelines, HyDE, and Contextual Retrieval.
Naive RAG Kills Production Systems
Common RAG Failure Modes
The standard tutorial pipeline (PDF → chunk → embed → cosine search → stuff context → GPT-4) breaks catastrophically at scale. After 50,000 documents, Top-K retrieval surfaces structurally irrelevant chunks because cosine distance doesn't understand contextual hierarchy. The LLM receives a schizophrenic, disconnected context window and confidently hallucinates.
The Bi-Encoder / Cross-Encoder Reranking Pipeline
Production RAG is a two-stage retrieval process. Stage 1 uses a fast Bi-Encoder (like text-embedding-3-small) to recall the Top-100 candidates using approximate nearest-neighbor search. Stage 2 uses a slower, more accurate Cross-Encoder (like Cohere Rerank or a fine-tuned BERT) that scores the exact relationship between the full query and each individual chunk. You surface the actual Top-5 from there.
Trade-off: You add ~300ms latency for the reranking pass. You buy an 80% reduction in hallucination rate from irrelevant context. This is always the right trade at enterprise scale.
HyDE (Hypothetical Document Embeddings)
The core insight: a user query ("What's the refund policy?") and the document containing the answer ("Our 30-day return policy states...") often have very different vector representations. HyDE fixes this by asking the LLM to generate a hypothetical answer to the query first, then embedding that hallucinated answer to search the database. A hallucinated answer is vectorially much closer to the real document than the original query.
Contextual Retrieval (Anthropic)
Isolated chunks lose context. The chunk "The revenue increased by 15%" is meaningless without knowing which company, which year. Contextual Retrieval uses Claude to generate a concise context summary ("This chunk describes Apple's Q3 2024 earnings call...") and prepends it to every chunk before embedding. This single step reduces retrieval failure rates by 49%.
Code Example
Production RAG pipeline combining Contextual Retrieval + HyDE + Cohere Reranking. Each pattern solves a specific failure mode: context loss, query-document vector mismatch, and top-k irrelevance.
1from anthropic import Anthropic
2import cohere
3from sentence_transformers import SentenceTransformer
4import numpy as np
5
6client = Anthropic()
7co = cohere.Client("YOUR_COHERE_KEY")
8encoder = SentenceTransformer('all-MiniLM-L6-v2')
9
10# Step 1: Contextual Retrieval - enrich chunks before indexing
11def add_context_to_chunk(full_document: str, chunk: str) -> str:
12 """Use Claude to prepend situational context to each chunk."""
13 response = client.messages.create(
14 model="claude-3-haiku-20240307", # Use cheap model for batch processing
15 max_tokens=150,
16 messages=[{
17 "role": "user",
18 "content": f"""<document>{full_document}</document>
19Here is the chunk: <chunk>{chunk}</chunk>
20
21Give a 1-2 sentence context for this chunk within the document. Be concise."""
22 }]
23 )
24 context = response.content[0].text
25 return f"{context}\n\n{chunk}" # Prepend context to chunk
26
27# Step 2: HyDE - embed a hypothetical answer, not the raw query
28def hyde_retrieve(query: str, chunks: list[str]) -> list[str]:
29 """Generate a hypothetical answer, embed it, retrieve similar chunks."""
30
31 # Generate a hallucinated answer
32 hyp_response = client.messages.create(
33 model="claude-3-haiku-20240307",
34 max_tokens=200,
35 messages=[{"role": "user", "content": f"Write a 2-sentence answer to: {query}"}]
36 )
37 hypothetical_answer = hyp_response.content[0].text
38
39 # Embed hypothetical answer (NOT the original query)
40 hyp_embedding = encoder.encode(hypothetical_answer)
41 chunk_embeddings = encoder.encode(chunks)
42
43 # Cosine similarity against the hypothetical answer vector
44 scores = np.dot(chunk_embeddings, hyp_embedding)
45 top_100_indices = np.argsort(scores)[-100:][::-1]
46 top_100_chunks = [chunks[i] for i in top_100_indices]
47
48 # Step 3: Rerank with Cross-Encoder (Cohere)
49 rerank_response = co.rerank(
50 query=query,
51 documents=top_100_chunks,
52 top_n=5, # Surface Top-5 from 100
53 model="rerank-english-v3.0"
54 )
55
56 return [top_100_chunks[r.index] for r in rerank_response.results]
57
58# Usage
59final_chunks = hyde_retrieve("What is the company's refund policy?", all_chunks)
60print(f"Retrieved {len(final_chunks)} high-quality chunks for context")Use Cases
Common Mistakes
Interview Insight
Relevance
High - Core Senior AI interview topic at Stripe, Notion, Glean.