Evaluating RAG Performance
Metrics like Faithfulness and Answer Relevance.
How Do You Know Your RAG System Works?
RAG Evaluation Framework
Building a RAG pipeline is easy. Building one that works reliably is hard. Evaluation is the difference between a demo and a production system.
Key Metrics (RAGAS Framework)
- Faithfulness: Does the answer only use information from the retrieved context? (Prevents hallucination)
- Answer Relevancy: Does the answer actually address the user's question?
- Context Precision: Are the retrieved chunks relevant to the question?
- Context Recall: Did we retrieve all the chunks needed to answer?
Evaluation Approach
- Create a test set of question-answer-context triplets
- Run your RAG pipeline on the questions
- Score each response using the metrics above
- Use an LLM-as-judge for automated scoring at scale
Code Example
Using the RAGAS framework to evaluate RAG quality across faithfulness, relevancy, and precision metrics.
python
1from ragas import evaluate
2from ragas.metrics import faithfulness, answer_relevancy, context_precision
3from datasets import Dataset
4
5# Your evaluation dataset
6eval_data = {
7 "question": ["What is RAG?", "How does chunking work?"],
8 "answer": ["RAG is Retrieval-Augmented Generation...", "Chunking splits..."],
9 "contexts": [["RAG combines retrieval..."], ["Text is split into..."]],
10 "ground_truth": ["RAG is a technique...", "Chunking divides..."]
11}
12
13dataset = Dataset.from_dict(eval_data)
14
15# Evaluate
16results = evaluate(
17 dataset,
18 metrics=[faithfulness, answer_relevancy, context_precision]
19)
20
21print(results)
22# {'faithfulness': 0.92, 'answer_relevancy': 0.88, 'context_precision': 0.85}Use Cases
Regression testing RAG pipelines after changes
Comparing different chunking strategies objectively
Identifying retrieval vs. generation failures
Setting quality baselines for production monitoring
Common Mistakes
Not evaluating at all — "it feels right" is not a metric
Only testing with easy questions that any approach would handle
Evaluating retrieval and generation together instead of separately
Using too small a test set (need 50+ diverse examples minimum)
Interview Insight
Relevance
Medium - Important for production systems