Introduction
Building a RAG (Retrieval-Augmented Generation) system is only half the battle. To ensure it works well, you need to evaluate it rigorously. Poor evaluation leads to hallucinations, irrelevant results, and frustrated users.
This comprehensive guide covers everything about RAG evaluation: metrics, tools, benchmarking, and optimization strategies.
Why RAG Evaluation Matters
The Challenge
RAG systems have multiple failure points:
| Component | Failure Mode |
|---|---|
| Retriever | Retrieves irrelevant documents |
| Chunker | Creates poor chunks |
| Embedder | Produces poor embeddings |
| Generator | Hallucinates answers |
| Re-ranker | Misses relevant results |
Evaluation Impact
Without evaluation:
- Unknown system quality
- No improvement direction
- User complaints increase
- Trust erodes
Evaluation Metrics
Retrieval Metrics
Precision@K
def precision_at_k(retrieved: list, relevant: list, k: int) -> float:
"""Calculate precision at k"""
retrieved_k = retrieved[:k]
relevant_retrieved = [doc for doc in retrieved_k if doc in relevant]
return len(relevant_retrieved) / k
Recall@K
def recall_at_k(retrieved: list, relevant: list, k: int) -> float:
"""Calculate recall at k"""
retrieved_k = retrieved[:k]
relevant_retrieved = [doc for doc in retrieved_k if doc in relevant]
return len(relevant_retrieved) / len(relevant) if relevant else 0
Mean Reciprocal Rank (MRR)
def mean_reciprocal_rank(queries: list) -> float:
"""Calculate MRR across queries"""
reciprocal_ranks = []
for query in queries:
for i, doc in enumerate(query.retrieved, 1):
if doc.is_relevant:
reciprocal_ranks.append(1 / i)
break
else:
reciprocal_ranks.append(0)
return sum(reciprocal_ranks) / len(reciprocal_ranks)
Generation Metrics
Faithfulness
def faithfulness(response: str, context: list) -> float:
"""Check if response is supported by context"""
# Use LLM to verify
prompt = f"""
Given the response and context, determine if the response
is supported by the context.
Response: {response}
Context: {context}
Is the response supported by the context? Yes or No.
"""
result = llm.invoke(prompt)
return 1.0 if "yes" in result.lower() else 0.0
Answer Relevance
def answer_relevance(response: str, question: str) -> float:
"""Measure how relevant the answer is to the question"""
prompt = f"""
Question: {question}
Answer: {response}
Rate the relevance of the answer to the question from 0 to 1.
Just output the number.
"""
result = llm.invoke(prompt)
return float(result.strip())
Evaluation Tools
RAGAs
The gold standard for RAG evaluation:
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall
)
from datasets import Dataset
# Prepare test data
test_data = Dataset.from_dict({
"question": ["What is RAG?", "How does retrieval work?"],
"answer": ["RAG is...", "Retrieval works by..."],
"contexts": [["RAG is..."], ["Retrieval works..."]],
"ground_truth": ["RAG is...", "Retrieval works by..."]
})
# Evaluate
results = evaluate(
test_data,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall
]
)
print(results)
TruLens
For production monitoring:
from trulens_eval import Feedback, TruLlama
from trulens_eval.feedback import Groundedness
# Define feedback functions
groundedness = Feedback(
Groundedness().on_output()
.from_context(TruLlama().select_context(x))
).aggregate(groundedness_aggregation)
f_response_helpfulness = Feedback(
llm.with_default_chain(
"Is the response helpful for the user query?"
)
.on_input()
.on_output()
)
# Evaluate
eval_result = (
TruLlama(agent)
.with_feedback([groundedness, f_response_helpfulness])
.evaluate()
)
LangSmith
# Use LangSmith for evaluation
from langsmith import evaluate, Client
client = Client()
# Define evaluation dataset
dataset = client.create_dataset(
"RAG Evaluation",
data=[
{"question": "...", "answer": "...", "contexts": [...]}
]
)
# Run evaluation
results = evaluate(
lambda x: rag_pipeline.invoke(x["question"]),
data=dataset,
metrics=[harfulness, relevancy]
)
Building Test Datasets
Creating Ground Truth
def create_test_dataset(documents: list, questions_per_doc: int = 5):
"""Create test dataset from documents"""
test_cases = []
for doc in documents:
# Generate questions about the document
questions = generate_questions(doc, n=questions_per_doc)
for question in questions:
# Get expected answer
answer = generate_answer(doc, question)
# Create test case
test_case = {
"question": question,
"answer": answer,
"contexts": [doc],
"ground_truth": answer
}
test_cases.append(test_case)
return test_cases
Question Types
# Test question categories
1. **Factoid**: Specific facts from documents
- "What year was X founded?"
2. **Definition**: Explanations
- "What is RAG?"
3. **Process**: How things work
- "How does indexing work?"
4. **Comparison**: Differences
- "What's the difference between X and Y?"
5. **Edge cases**: Boundary conditions
- "What happens if no documents match?"
Benchmarking Strategies
A/B Testing
def ab_test(rag_v1, rag_v2, test_dataset):
"""Compare two RAG systems"""
results_v1 = evaluate_system(rag_v1, test_dataset)
results_v2 = evaluate_system(rag_v2, test_dataset)
# Statistical significance
from scipy import stats
t_stat, p_value = stats.ttest_ind(
results_v1.scores,
results_v2.scores
)
return {
"v1_score": results_v1.avg_score,
"v2_score": results_v2.avg_score,
"improvement": results_v2.avg_score - results_v1.avg_score,
"significant": p_value < 0.05
}
Regression Testing
# Track metrics over time
import pandas as pd
def track_metrics(system, test_dataset, runs=10):
"""Track metrics across multiple runs"""
results = []
for i in range(runs):
# Add some noise to simulate production
noisy_dataset = add_noise(test_dataset)
score = evaluate_system(system, noisy_dataset)
results.append({
"run": i,
"faithfulness": score.faithfulness,
"relevancy": score.relevancy,
"timestamp": datetime.now()
})
df = pd.DataFrame(results)
# Check for regressions
avg_score = df["faithfulness"].mean()
std = df["faithfulness"].std()
return {"mean": avg_score, "std": std, "data": df}
Common Issues and Fixes
Issue 1: Retrieved docs not relevant
# Fix: Improve chunking strategy
# Before: Fixed-size chunks
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(chunk_size=500)
# After: Semantic chunking
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile"
)
Issue 2: Poor answer quality
# Fix: Improve prompt
# Before: Basic prompt
prompt = """Answer based on context: {context}
Question: {question}"""
# After: Structured prompt with examples
prompt = """You are a helpful assistant. Use the context
to answer the question accurately.
Example:
Context: The company was founded in 2020.
Question: When was it founded?
Answer: The company was founded in 2020.
Now answer:
Context: {context}
Question: {question}
Answer:"""
Issue 3: High latency
# Fix: Add caching and async
from functools import lru_cache
import asyncio
@lru_cache(maxsize=1000)
def get_cached_embedding(text):
return embedding_model.encode(text)
async def async_retrieve(query):
# Retrieve in parallel
tasks = [
get_relevant_docs(chunk)
for chunk in query.chunks
]
results = await asyncio.gather(*tasks)
return combine_results(results)
Production Monitoring
Dashboard Metrics
# Key metrics to track in production
METRICS = {
"retrieval_latency": "P50, P95, P99",
"generation_latency": "P50, P95, P99",
"retrieval_recall": "Average @ top_k",
"faithfulness": "Per-query score",
"answer_relevancy": "Per-query score",
"user_satisfaction": "Thumbs up/down ratio",
"error_rate": "Failed requests / total"
}
Alerting
# Alert on degradation
def check_metrics(current_metrics, baseline_metrics):
alerts = []
# Faithfulness drop
if current_metrics.faithfulness < baseline_metrics.faithfulness * 0.9:
alerts.append("Faithfulness dropped by >10%")
# Latency spike
if current_metrics.p95_latency > baseline_metrics.p95_latency * 1.5:
alerts.append("P95 latency increased by >50%")
# High error rate
if current_metrics.error_rate > 0.05:
alerts.append("Error rate above 5%")
return alerts
External Resources
Tools
Papers
Conclusion
RAG evaluation is essential for production systems. Use the right metrics, build comprehensive test sets, and monitor in production.
Key takeaways:
- Measure both retrieval and generation - End-to-end metrics matter
- Build diverse test sets - Cover different question types
- Monitor in production - Catch regressions early
- Iterate based on data - Let metrics guide improvements
Comments