Introduction
Building a production-ready Retrieval-Augmented Generation (RAG) system requires rigorous evaluation. Unlike traditional ML models, RAG systems involve multiple componentsโretrieval, generation, and their interactionโthat all need testing.
This guide covers three leading tools for RAG evaluation: RAGAs (RAG Assessment), TruLens, and Helicone. Each provides different capabilities for measuring and improving your RAG pipeline.
Understanding RAG Evaluation
RAG systems have multiple failure modes:
- Retrieval failures: Wrong or incomplete context retrieved
- Generation failures: Model ignores context or generates hallucinations
- Pipeline failures: Poor integration between components
Key Metrics
# Core RAG metrics
metrics = {
"context_precision": "How relevant is retrieved context?",
"context_recall": "Does retrieved context contain answer?",
"faithfulness": "Does answer match retrieved context?",
"answer_relevancy": "How relevant is answer to question?",
"answer_similarity": "How similar is answer to ground truth?",
"answer_correctness": "Is answer factually correct?"
}
RAGAs: Retrieval-Augmented Generation Assessment
RAGAs provides a framework for evaluating RAG pipelines using LLM-as-a-judge approach.
RAGAs Installation and Setup
# Install ragas
pip install ragas
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
context_relevancy
)
from ragas.metrics.critique import harmfulness
# Prepare test dataset
from datasets import Dataset
eval_data = Dataset.from_dict({
"question": [
"What is machine learning?",
"How does neural network work?",
"What are transformers in NLP?"
],
"answer": [
"Machine learning is a subset of AI...",
"Neural networks are computing systems...",
"Transformers are a type of neural network..."
],
"contexts": [
["ML is a method where computers learn from data..."],
["Neural networks consist of interconnected nodes..."],
["Transformers use attention mechanisms..."]
],
"ground_truth": [
"Machine learning is a subset of artificial intelligence...",
"A neural network is inspired by biological neurons...",
"Transformers are deep learning models introduced by Google..."
]
})
Running RAGAs Evaluation
# Run evaluation with RAGAs
results = evaluate(
eval_data,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
context_relevancy
]
)
# Convert to pandas for analysis
df = results.to_pandas()
print(df)
# Output:
# question faithfulness answer_relevancy context_precision context_recall
# 0 What is... 0.85 0.92 0.78 0.80
# 1 How does... 0.90 0.88 0.82 0.75
# 2 What are... 0.88 0.95 0.85 0.90
Custom RAGAs Metrics
from ragas.metrics import MetricWithLLM, SingleColumnMetric
from ragas.llms import LangchainLLMWrapper
from ragas.runnable import ragas_callback
from langchain_openai import ChatOpenAI
# Wrap LLM for RAGAs
llm = ChatOpenAI(model="gpt-4")
llm_wrapped = LangchainLLMWrapper(llm)
# Custom metric for answer harmfulness
from ragas.metrics.critique import harmfulness
# Custom metric for domain specificity
class DomainSpecificity(MetricWithLLM):
name = "domain_specific_mode = "byLLM"
async def _ity"
evaluationascore(self, row, callbacks):
prompt = f"""Rate how domain-specific this answer is (0-1):
Question: {row['question']}
Answer: {row['answer']}
Consider whether the answer uses domain-specific terminology
and provides technically accurate information.
"""
result = await self.llm.agenerate([prompt])
return float(result.generations[0][0].text.strip())
# Use custom metric
specificity = DomainSpecificity(llm=llm_wrapped)
results = evaluate(eval_data, metrics=[faithfulness, specificity])
TruLens: Deep Learning Observatory
TruLens provides observability for AI applications with a focus on feedback loops and continuous improvement.
TruLens Setup
# Install trulens
pip install trulens trulens-eval
from trulens.apps.langchain import LangchainEmbedder
from trulens_eval import Feedback, Tru, TruChain
from trulens_eval.feedback import Groundedness
from trulens_eval.feedback.provider import OpenAI as TruOpenAI
# Initialize TruLens
tru = Tru()
# Set up provider
openai_provider = TruOpenAI()
# Define feedback functions
groundedness = Groundedness(groundedness_provider=openai_provider)
feedbacks = [
Feedback(
openai_provider.relevance_with_cot_reasons,
name="Answer Relevance"
).on_input_output(),
Feedback(
groundedness.groundedness_measure_with_cot_reasons,
name="Groundedness"
).on_context().on_output(),
Feedback(
openai_provider.qs_in_context_relevance_with_cot_reasons,
name="Context Relevance"
).on_input().on_context()
]
TruLens with RAG Chain
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
# Create RAG chain
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever()
qa_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4"),
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)
# Wrap with TruLens
tru_chain = TruChain(
qa_chain,
app_id="my-rag-app",
feedbacks=feedbacks
)
# Run evaluation
questions = [
"What is machine learning?",
"Explain neural networks",
"What are transformers?"
]
records = []
for question in questions:
with tru_chain as recording:
result = tru_chain.invoke(question)
records.append({
"question": question,
"answer": result["result"],
"sources": [doc.page_content for doc in result["source_documents"]]
})
# Get feedback
tru.get_leaderboard(app_ids=["my-rag-app"])
Helicone: LLM Observability Platform
Helicone focuses on tracking LLM requests, costs, latency, and debugging prompts.
Helicone Basic Setup
# Install Helicone
pip install helicone
# Set API key
import helicone
helicone.api_key = "your-helicone-key"
# Wrap OpenAI client
from helicone.openai import HeliconeOpenAI
from openai import OpenAI
# Use Helicone wrapper
client = HeliconeOpenAI(api_key="sk-...")
# All requests are automatically tracked
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
# View in Helicone dashboard
print(f"Request ID: {response.helicone_request_id}")
Helicone with RAG
from helicone.openai import HeliconeOpenAI
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.embeddings import OpenAIEmbeddings
# Initialize Helicone-wrapped client
client = HeliconeOpenAI(api_key="sk-...")
# Create RAG chain with Helicone
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(client=client),
retriever=vectorstore.as_retriever()
)
# Query with automatic tracking
result = qa.invoke("What is machine learning?")
# Helicone captures:
# - Token usage per request
# - Latency breakdown
# - Cost analysis
# - Prompt caching stats
Comparison: RAGAs vs TruLens vs Helicone
| Aspect | RAGAs | TruLens | Helicone |
|---|---|---|---|
| Focus | RAG metrics | AI observability | LLM tracking |
| Metrics | Retrieval + Generation | Hallucination, relevance | Cost, latency |
| Integration | Standalone | LangChain, LlamaIndex | Any LLM |
| LLM-as-Judge | Built-in | Built-in | Via custom |
| Cost Tracking | No | Limited | Yes |
| Real-time | Batch | Real-time | Real-time |
When to Use Each Tool
Use RAGAs When:
- You need comprehensive RAG metrics
- You want automated evaluation
- You’re comparing different retrieval strategies
Use TruLens When:
- Building production AI apps
- Need real-time feedback
- Using LangChain or LlamaIndex
Use Helicone When:
- Need cost tracking and optimization
- Debugging LLM prompts
- Multi-model management
Bad Practices to Avoid
Bad Practice 1: No Evaluation Before Production
# Bad: Deploying RAG without evaluation
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
qa.invoke("production question") # No metrics!
Bad Practice 2: Ignoring Retrieval Metrics
# Bad: Only measuring answer quality
# Problem: Don't know if retrieval is the issue
Good Practices Summary
Building Evaluation Datasets
# Good: Diverse evaluation dataset
from datasets import Dataset
eval_data = Dataset.from_dict({
"question": [
# Different question types
"factual", "explanatory", "comparative",
"procedural", "opinion"
] * 20,
"answer": [...],
"contexts": [...],
"ground_truth": [...]
})
External Resources
- RAGAs Documentation
- TruLens Documentation
- Helicone Documentation
- RAG Evaluation Metrics - DeepLearning.AI
- RAGAs Paper
Comments