Skip to main content

RAG Evaluation: RAGAs, TruLens, and Helicone - Complete Guide

Created: December 22, 2025 Larry Qu 5 min read

Introduction

Building a production-ready Retrieval-Augmented Generation (RAG) system requires rigorous evaluation. Unlike traditional ML models, RAG systems involve multiple components—retrieval, generation, and their interaction—that all need testing.

This guide covers three leading tools for RAG evaluation: RAGAs (RAG Assessment), TruLens, and Helicone. Each provides different capabilities for measuring and improving your RAG pipeline.

Understanding RAG Evaluation

RAG systems have multiple failure modes:

  • Retrieval failures: Wrong or incomplete context retrieved
  • Generation failures: Model ignores context or generates hallucinations
  • Pipeline failures: Poor integration between components

Key Metrics

# Core RAG metrics
metrics = {
    "context_precision": "How relevant is retrieved context?",
    "context_recall": "Does retrieved context contain answer?",
    "faithfulness": "Does answer match retrieved context?",
    "answer_relevancy": "How relevant is answer to question?",
    "answer_similarity": "How similar is answer to ground truth?",
    "answer_correctness": "Is answer factually correct?"
}

RAGAs: Retrieval-Augmented Generation Assessment

RAGAs provides a framework for evaluating RAG pipelines using LLM-as-a-judge approach.

RAGAs Installation and Setup

# Install ragas
pip install ragas

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    context_relevancy
)
from ragas.metrics.critique import harmfulness

# Prepare test dataset
from datasets import Dataset

eval_data = Dataset.from_dict({
    "question": [
        "What is machine learning?",
        "How does neural network work?",
        "What are transformers in NLP?"
    ],
    "answer": [
        "Machine learning is a subset of AI...",
        "Neural networks are computing systems...",
        "Transformers are a type of neural network..."
    ],
    "contexts": [
        ["ML is a method where computers learn from data..."],
        ["Neural networks consist of interconnected nodes..."],
        ["Transformers use attention mechanisms..."]
    ],
    "ground_truth": [
        "Machine learning is a subset of artificial intelligence...",
        "A neural network is inspired by biological neurons...",
        "Transformers are deep learning models introduced by Google..."
    ]
})

Running RAGAs Evaluation

# Run evaluation with RAGAs
results = evaluate(
    eval_data,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        context_relevancy
    ]
)

# Convert to pandas for analysis
df = results.to_pandas()
print(df)

# Output:
#    question  faithfulness  answer_relevancy  context_precision  context_recall
# 0  What is...          0.85              0.92               0.78            0.80
# 1  How does...         0.90              0.88               0.82            0.75
# 2  What are...         0.88              0.95               0.85            0.90

Custom RAGAs Metrics

from ragas.metrics import MetricWithLLM, SingleColumnMetric
from ragas.llms import LangchainLLMWrapper
from ragas.runnable import ragas_callback
from langchain_openai import ChatOpenAI

# Wrap LLM for RAGAs
llm = ChatOpenAI(model="gpt-4")
llm_wrapped = LangchainLLMWrapper(llm)

# Custom metric for answer harmfulness
from ragas.metrics.critique import harmfulness

# Custom metric for domain specificity
class DomainSpecificity(MetricWithLLM):
    name = "domain_specific_mode = "byLLM"
    
    async def _ity"
    evaluationascore(self, row, callbacks):
        prompt = f"""Rate how domain-specific this answer is (0-1):
        
Question: {row['question']}
Answer: {row['answer']}

Consider whether the answer uses domain-specific terminology 
and provides technically accurate information.
"""
        result = await self.llm.agenerate([prompt])
        return float(result.generations[0][0].text.strip())

# Use custom metric
specificity = DomainSpecificity(llm=llm_wrapped)
results = evaluate(eval_data, metrics=[faithfulness, specificity])

TruLens: Deep Learning Observatory

TruLens provides observability for AI applications with a focus on feedback loops and continuous improvement.

TruLens Setup

# Install trulens
pip install trulens trulens-eval

from trulens.apps.langchain import LangchainEmbedder
from trulens_eval import Feedback, Tru, TruChain
from trulens_eval.feedback import Groundedness
from trulens_eval.feedback.provider import OpenAI as TruOpenAI

# Initialize TruLens
tru = Tru()

# Set up provider
openai_provider = TruOpenAI()

# Define feedback functions
groundedness = Groundedness(groundedness_provider=openai_provider)
feedbacks = [
    Feedback(
        openai_provider.relevance_with_cot_reasons,
        name="Answer Relevance"
    ).on_input_output(),
    Feedback(
        groundedness.groundedness_measure_with_cot_reasons,
        name="Groundedness"
    ).on_context().on_output(),
    Feedback(
        openai_provider.qs_in_context_relevance_with_cot_reasons,
        name="Context Relevance"
    ).on_input().on_context()
]

TruLens with RAG Chain

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

# Create RAG chain
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

# Wrap with TruLens
tru_chain = TruChain(
    qa_chain,
    app_id="my-rag-app",
    feedbacks=feedbacks
)

# Run evaluation
questions = [
    "What is machine learning?",
    "Explain neural networks",
    "What are transformers?"
]

records = []
for question in questions:
    with tru_chain as recording:
        result = tru_chain.invoke(question)
        records.append({
            "question": question,
            "answer": result["result"],
            "sources": [doc.page_content for doc in result["source_documents"]]
        })

# Get feedback
tru.get_leaderboard(app_ids=["my-rag-app"])

Helicone: LLM Observability Platform

Helicone focuses on tracking LLM requests, costs, latency, and debugging prompts.

Helicone Basic Setup

# Install Helicone
pip install helicone

# Set API key
import helicone
helicone.api_key = "your-helicone-key"

# Wrap OpenAI client
from helicone.openai import HeliconeOpenAI
from openai import OpenAI

# Use Helicone wrapper
client = HeliconeOpenAI(api_key="sk-...")

# All requests are automatically tracked
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

# View in Helicone dashboard
print(f"Request ID: {response.helicone_request_id}")

Helicone with RAG

from helicone.openai import HeliconeOpenAI
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.embeddings import OpenAIEmbeddings

# Initialize Helicone-wrapped client
client = HeliconeOpenAI(api_key="sk-...")

# Create RAG chain with Helicone
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(client=client),
    retriever=vectorstore.as_retriever()
)

# Query with automatic tracking
result = qa.invoke("What is machine learning?")

# Helicone captures:
# - Token usage per request
# - Latency breakdown
# - Cost analysis
# - Prompt caching stats

Comparison: RAGAs vs TruLens vs Helicone

Aspect RAGAs TruLens Helicone
Focus RAG metrics AI observability LLM tracking
Metrics Retrieval + Generation Hallucination, relevance Cost, latency
Integration Standalone LangChain, LlamaIndex Any LLM
LLM-as-Judge Built-in Built-in Via custom
Cost Tracking No Limited Yes
Real-time Batch Real-time Real-time

When to Use Each Tool

Use RAGAs When:

  • You need comprehensive RAG metrics
  • You want automated evaluation
  • You’re comparing different retrieval strategies

Use TruLens When:

  • Building production AI apps
  • Need real-time feedback
  • Using LangChain or LlamaIndex

Use Helicone When:

  • Need cost tracking and optimization
  • Debugging LLM prompts
  • Multi-model management

Bad Practices to Avoid

Bad Practice 1: No Evaluation Before Production

# Bad: Deploying RAG without evaluation
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
qa.invoke("production question")  # No metrics!

Bad Practice 2: Ignoring Retrieval Metrics

# Bad: Only measuring answer quality
# Problem: Don't know if retrieval is the issue

Good Practices Summary

Building Evaluation Datasets

# Good: Diverse evaluation dataset
from datasets import Dataset

eval_data = Dataset.from_dict({
    "question": [
        # Different question types
        "factual", "explanatory", "comparative", 
        "procedural", "opinion"
    ] * 20,
    "answer": [...],
    "contexts": [...],
    "ground_truth": [...]
})

External Resources

Resources

Comments

Share this article

Scan to read on mobile