RAG Evaluation & Observability
RAGAS metrics, LLM-as-judge, and tracing with LangSmith.
If You Can't Measure It, You Can't Ship It
Production RAG Eval Pipeline
Most teams ship RAG demos and never know if it actually works. Production RAG demands a rigorous evaluation framework. You need to know your faithful answer rate, your context recall, and your hallucination rate before a customer finds the failures for you.
RAGAS: The Industry Standard RAG Metric Suite
RAGAS provides 4 key metrics calculated by an LLM judge:
- Faithfulness: Does the generated answer only use facts from the retrieved context? (Hallucination score)
- Answer Relevancy: Does the answer actually address the user's question?
- Context Precision: What fraction of the retrieved chunks were actually useful?
- Context Recall: Did retrieval surface all the chunks needed to answer the question?
Low Context Precision means your retrieval is pulling junk. Low Context Recall means your chunking strategy is losing information. These have completely different fixes — and you can only distinguish them with RAGAS.
LLM-as-Judge Pattern
Beyond automated metrics, enterprise teams run LLM-as-Judge evaluations. You define an evaluation rubric (correctness, completeness, tone) and have a strong model (GPT-4o) score each output from 1-5 with a reasoning trace. This generates a scalable labeling pipeline that costs $0.01 per evaluation vs. $50 for a human annotator.
Tracing with LangSmith
In production, every agent run generates a trace: the full message history, every tool call with its latency, token counts, and the final output. LangSmith captures and visualizes these traces, letting you drill into exactly which retrieval step failed, how long each LLM call took, and where your token budget was wasted. Without this, debugging production failures is archaeological guesswork.
Code Example
RAGAS evaluation pipeline with LangSmith tracing. Low context_precision tells you to fix retrieval. Low faithfulness tells you the LLM is hallucinating outside the context. These metrics diagnose different parts of the pipeline.
1from ragas import evaluate
2from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
3from datasets import Dataset
4import os
5
6# Enable LangSmith tracing for all LLM calls
7os.environ["LANGCHAIN_TRACING_V2"] = "true"
8os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
9os.environ["LANGCHAIN_PROJECT"] = "rag-production-eval"
10
11# Build evaluation dataset (your golden test set)
12eval_data = {
13 "question": [
14 "What is the company's refund policy?",
15 "How long does shipping take?"
16 ],
17 "answer": [
18 "Refunds are processed within 30 business days.", # Your RAG output
19 "Standard shipping takes 5-7 business days."
20 ],
21 "contexts": [
22 ["Our 30-day return policy allows full refunds..."], # Retrieved chunks
23 ["We offer standard shipping (5-7 days) and express (2-3 days)..."]
24 ],
25 "ground_truth": [
26 "The company processes refunds within 30 business days.", # Gold answer
27 "Standard orders ship in 5-7 business days."
28 ]
29}
30
31dataset = Dataset.from_dict(eval_data)
32
33# Run full RAGAS evaluation suite
34results = evaluate(
35 dataset=dataset,
36 metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
37)
38
39print(results)
40# Output:
41# {'faithfulness': 0.95, 'answer_relevancy': 0.88,
42# 'context_precision': 0.72, 'context_recall': 0.91}
43
44# Low context_precision (0.72) means 28% of retrieved chunks were useless
45# Fix: Add Cross-Encoder reranking to filter junk before LLM callUse Cases
Common Mistakes
Interview Insight
Relevance
High - You cannot ship RAG to production without a measurement framework.