Why Naive RAG Fails in Production

Bi-Encoder/Cross-Encoder pipelines, HyDE, and Contextual Retrieval.

Naive RAG Kills Production Systems

Common RAG Failure Modes

❌ Bad Chunking splits mid-sentence ❌ Wrong Retrieval semantic != relevant ❌ Lost in Context too many chunks ✅ Semantic Chunks paragraph-aware split ✅ Hybrid + Rerank BM25 + CrossEncoder ✅ Compress Context summarize before LLM

The standard tutorial pipeline (PDF → chunk → embed → cosine search → stuff context → GPT-4) breaks catastrophically at scale. After 50,000 documents, Top-K retrieval surfaces structurally irrelevant chunks because cosine distance doesn't understand contextual hierarchy. The LLM receives a schizophrenic, disconnected context window and confidently hallucinates.

The Bi-Encoder / Cross-Encoder Reranking Pipeline

Production RAG is a two-stage retrieval process. Stage 1 uses a fast Bi-Encoder (like text-embedding-3-small) to recall the Top-100 candidates using approximate nearest-neighbor search. Stage 2 uses a slower, more accurate Cross-Encoder (like Cohere Rerank or a fine-tuned BERT) that scores the exact relationship between the full query and each individual chunk. You surface the actual Top-5 from there.

Trade-off: You add ~300ms latency for the reranking pass. You buy an 80% reduction in hallucination rate from irrelevant context. This is always the right trade at enterprise scale.

HyDE (Hypothetical Document Embeddings)

The core insight: a user query ("What's the refund policy?") and the document containing the answer ("Our 30-day return policy states...") often have very different vector representations. HyDE fixes this by asking the LLM to generate a hypothetical answer to the query first, then embedding that hallucinated answer to search the database. A hallucinated answer is vectorially much closer to the real document than the original query.

Contextual Retrieval (Anthropic)

Isolated chunks lose context. The chunk "The revenue increased by 15%" is meaningless without knowing which company, which year. Contextual Retrieval uses Claude to generate a concise context summary ("This chunk describes Apple's Q3 2024 earnings call...") and prepends it to every chunk before embedding. This single step reduces retrieval failure rates by 49%.

User Query Bi-Encoder Top-100 Fast Recall Cross-Encoder Rerank → Top-5 Precise Scoring LLM Generate with Clean Context Answer

Code Example

Production RAG pipeline combining Contextual Retrieval + HyDE + Cohere Reranking. Each pattern solves a specific failure mode: context loss, query-document vector mismatch, and top-k irrelevance.

python
1from anthropic import Anthropic
2import cohere
3from sentence_transformers import SentenceTransformer
4import numpy as np
5
6client = Anthropic()
7co = cohere.Client("YOUR_COHERE_KEY")
8encoder = SentenceTransformer('all-MiniLM-L6-v2')
9
10# Step 1: Contextual Retrieval - enrich chunks before indexing
11def add_context_to_chunk(full_document: str, chunk: str) -> str:
12    """Use Claude to prepend situational context to each chunk."""
13    response = client.messages.create(
14        model="claude-3-haiku-20240307", # Use cheap model for batch processing
15        max_tokens=150,
16        messages=[{
17            "role": "user",
18            "content": f"""<document>{full_document}</document>
19Here is the chunk: <chunk>{chunk}</chunk>
20
21Give a 1-2 sentence context for this chunk within the document. Be concise."""
22        }]
23    )
24    context = response.content[0].text
25    return f"{context}\n\n{chunk}"  # Prepend context to chunk
26
27# Step 2: HyDE - embed a hypothetical answer, not the raw query
28def hyde_retrieve(query: str, chunks: list[str]) -> list[str]:
29    """Generate a hypothetical answer, embed it, retrieve similar chunks."""
30    
31    # Generate a hallucinated answer
32    hyp_response = client.messages.create(
33        model="claude-3-haiku-20240307",
34        max_tokens=200,
35        messages=[{"role": "user", "content": f"Write a 2-sentence answer to: {query}"}]
36    )
37    hypothetical_answer = hyp_response.content[0].text
38    
39    # Embed hypothetical answer (NOT the original query)
40    hyp_embedding = encoder.encode(hypothetical_answer)
41    chunk_embeddings = encoder.encode(chunks)
42    
43    # Cosine similarity against the hypothetical answer vector
44    scores = np.dot(chunk_embeddings, hyp_embedding)
45    top_100_indices = np.argsort(scores)[-100:][::-1]
46    top_100_chunks = [chunks[i] for i in top_100_indices]
47    
48    # Step 3: Rerank with Cross-Encoder (Cohere)
49    rerank_response = co.rerank(
50        query=query,
51        documents=top_100_chunks,
52        top_n=5,  # Surface Top-5 from 100
53        model="rerank-english-v3.0"
54    )
55    
56    return [top_100_chunks[r.index] for r in rerank_response.results]
57
58# Usage
59final_chunks = hyde_retrieve("What is the company's refund policy?", all_chunks)
60print(f"Retrieved {len(final_chunks)} high-quality chunks for context")

Use Cases

Enterprise document search with 1M+ documents (Legal, Finance, Compliance)
Code search systems where query intent differs heavily from function signatures
Customer support agents requiring precise policy retrieval without hallucination

Common Mistakes

Using cosine similarity alone for Top-K at scale — you will get irrelevant results at high document counts
Running Cross-Encoder reranking on 1000 candidates — it is O(n) expensive; always pre-filter to Top-100 first with a Bi-Encoder
Not invalidating embeddings when source documents are updated

Interview Insight

Relevance

High - Core Senior AI interview topic at Stripe, Notion, Glean.

AI Tutor

Ask about the topic

Sign in Required

Please sign in to use the AI tutor

Sign In