RAG Evaluation & Observability

RAGAS metrics, LLM-as-judge, and tracing with LangSmith.

If You Can't Measure It, You Can't Ship It

Production RAG Eval Pipeline

Test Dataset RAG Pipeline retrieve+generate LLM Judge score quality CI/CD pass/fail 🚀

Most teams ship RAG demos and never know if it actually works. Production RAG demands a rigorous evaluation framework. You need to know your faithful answer rate, your context recall, and your hallucination rate before a customer finds the failures for you.

RAGAS: The Industry Standard RAG Metric Suite

RAGAS provides 4 key metrics calculated by an LLM judge:

  • Faithfulness: Does the generated answer only use facts from the retrieved context? (Hallucination score)
  • Answer Relevancy: Does the answer actually address the user's question?
  • Context Precision: What fraction of the retrieved chunks were actually useful?
  • Context Recall: Did retrieval surface all the chunks needed to answer the question?

Low Context Precision means your retrieval is pulling junk. Low Context Recall means your chunking strategy is losing information. These have completely different fixes — and you can only distinguish them with RAGAS.

LLM-as-Judge Pattern

Beyond automated metrics, enterprise teams run LLM-as-Judge evaluations. You define an evaluation rubric (correctness, completeness, tone) and have a strong model (GPT-4o) score each output from 1-5 with a reasoning trace. This generates a scalable labeling pipeline that costs $0.01 per evaluation vs. $50 for a human annotator.

Tracing with LangSmith

In production, every agent run generates a trace: the full message history, every tool call with its latency, token counts, and the final output. LangSmith captures and visualizes these traces, letting you drill into exactly which retrieval step failed, how long each LLM call took, and where your token budget was wasted. Without this, debugging production failures is archaeological guesswork.

Code Example

RAGAS evaluation pipeline with LangSmith tracing. Low context_precision tells you to fix retrieval. Low faithfulness tells you the LLM is hallucinating outside the context. These metrics diagnose different parts of the pipeline.

python
1from ragas import evaluate
2from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
3from datasets import Dataset
4import os
5
6# Enable LangSmith tracing for all LLM calls
7os.environ["LANGCHAIN_TRACING_V2"] = "true"
8os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
9os.environ["LANGCHAIN_PROJECT"] = "rag-production-eval"
10
11# Build evaluation dataset (your golden test set)
12eval_data = {
13    "question": [
14        "What is the company's refund policy?",
15        "How long does shipping take?"
16    ],
17    "answer": [
18        "Refunds are processed within 30 business days.",  # Your RAG output
19        "Standard shipping takes 5-7 business days."
20    ],
21    "contexts": [
22        ["Our 30-day return policy allows full refunds..."],  # Retrieved chunks
23        ["We offer standard shipping (5-7 days) and express (2-3 days)..."]
24    ],
25    "ground_truth": [
26        "The company processes refunds within 30 business days.",  # Gold answer
27        "Standard orders ship in 5-7 business days."
28    ]
29}
30
31dataset = Dataset.from_dict(eval_data)
32
33# Run full RAGAS evaluation suite
34results = evaluate(
35    dataset=dataset,
36    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
37)
38
39print(results)
40# Output:
41# {'faithfulness': 0.95, 'answer_relevancy': 0.88,
42#  'context_precision': 0.72, 'context_recall': 0.91}
43
44# Low context_precision (0.72) means 28% of retrieved chunks were useless
45# Fix: Add Cross-Encoder reranking to filter junk before LLM call

Use Cases

Pre-deployment validation before releasing a new RAG system to production
A/B testing retrieval strategies (cosine vs. reranking vs. HyDE)
Continuous monitoring dashboards for deployed RAG systems

Common Mistakes

Evaluating RAG quality only with human thumbs-up/thumbs-down — too slow and expensive to be useful at scale
Not building a golden evaluation dataset before starting development
Optimizing for answer quality without measuring context precision — you may be wasting 60% of your context window on junk

Interview Insight

Relevance

High - You cannot ship RAG to production without a measurement framework.

AI Tutor

Ask about the topic

Sign in Required

Please sign in to use the AI tutor

Sign In