Skip to main content
โšก Calmops

RAG Evaluation: RAGAs, TruLens, and Helicone - Complete Guide

Introduction

Building a production-ready Retrieval-Augmented Generation (RAG) system requires rigorous evaluation. Unlike traditional ML models, RAG systems involve multiple componentsโ€”retrieval, generation, and their interactionโ€”that all need testing.

This guide covers three leading tools for RAG evaluation: RAGAs (RAG Assessment), TruLens, and Helicone. Each provides different capabilities for measuring and improving your RAG pipeline.

Understanding RAG Evaluation

RAG systems have multiple failure modes:

  • Retrieval failures: Wrong or incomplete context retrieved
  • Generation failures: Model ignores context or generates hallucinations
  • Pipeline failures: Poor integration between components

Key Metrics

# Core RAG metrics
metrics = {
    "context_precision": "How relevant is retrieved context?",
    "context_recall": "Does retrieved context contain answer?",
    "faithfulness": "Does answer match retrieved context?",
    "answer_relevancy": "How relevant is answer to question?",
    "answer_similarity": "How similar is answer to ground truth?",
    "answer_correctness": "Is answer factually correct?"
}

RAGAs: Retrieval-Augmented Generation Assessment

RAGAs provides a framework for evaluating RAG pipelines using LLM-as-a-judge approach.

RAGAs Installation and Setup

# Install ragas
pip install ragas

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    context_relevancy
)
from ragas.metrics.critique import harmfulness

# Prepare test dataset
from datasets import Dataset

eval_data = Dataset.from_dict({
    "question": [
        "What is machine learning?",
        "How does neural network work?",
        "What are transformers in NLP?"
    ],
    "answer": [
        "Machine learning is a subset of AI...",
        "Neural networks are computing systems...",
        "Transformers are a type of neural network..."
    ],
    "contexts": [
        ["ML is a method where computers learn from data..."],
        ["Neural networks consist of interconnected nodes..."],
        ["Transformers use attention mechanisms..."]
    ],
    "ground_truth": [
        "Machine learning is a subset of artificial intelligence...",
        "A neural network is inspired by biological neurons...",
        "Transformers are deep learning models introduced by Google..."
    ]
})

Running RAGAs Evaluation

# Run evaluation with RAGAs
results = evaluate(
    eval_data,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        context_relevancy
    ]
)

# Convert to pandas for analysis
df = results.to_pandas()
print(df)

# Output:
#    question  faithfulness  answer_relevancy  context_precision  context_recall
# 0  What is...          0.85              0.92               0.78            0.80
# 1  How does...         0.90              0.88               0.82            0.75
# 2  What are...         0.88              0.95               0.85            0.90

Custom RAGAs Metrics

from ragas.metrics import MetricWithLLM, SingleColumnMetric
from ragas.llms import LangchainLLMWrapper
from ragas.runnable import ragas_callback
from langchain_openai import ChatOpenAI

# Wrap LLM for RAGAs
llm = ChatOpenAI(model="gpt-4")
llm_wrapped = LangchainLLMWrapper(llm)

# Custom metric for answer harmfulness
from ragas.metrics.critique import harmfulness

# Custom metric for domain specificity
class DomainSpecificity(MetricWithLLM):
    name = "domain_specific_mode = "byLLM"
    
    async def _ity"
    evaluationascore(self, row, callbacks):
        prompt = f"""Rate how domain-specific this answer is (0-1):
        
Question: {row['question']}
Answer: {row['answer']}

Consider whether the answer uses domain-specific terminology 
and provides technically accurate information.
"""
        result = await self.llm.agenerate([prompt])
        return float(result.generations[0][0].text.strip())

# Use custom metric
specificity = DomainSpecificity(llm=llm_wrapped)
results = evaluate(eval_data, metrics=[faithfulness, specificity])

TruLens: Deep Learning Observatory

TruLens provides observability for AI applications with a focus on feedback loops and continuous improvement.

TruLens Setup

# Install trulens
pip install trulens trulens-eval

from trulens.apps.langchain import LangchainEmbedder
from trulens_eval import Feedback, Tru, TruChain
from trulens_eval.feedback import Groundedness
from trulens_eval.feedback.provider import OpenAI as TruOpenAI

# Initialize TruLens
tru = Tru()

# Set up provider
openai_provider = TruOpenAI()

# Define feedback functions
groundedness = Groundedness(groundedness_provider=openai_provider)
feedbacks = [
    Feedback(
        openai_provider.relevance_with_cot_reasons,
        name="Answer Relevance"
    ).on_input_output(),
    Feedback(
        groundedness.groundedness_measure_with_cot_reasons,
        name="Groundedness"
    ).on_context().on_output(),
    Feedback(
        openai_provider.qs_in_context_relevance_with_cot_reasons,
        name="Context Relevance"
    ).on_input().on_context()
]

TruLens with RAG Chain

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

# Create RAG chain
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

# Wrap with TruLens
tru_chain = TruChain(
    qa_chain,
    app_id="my-rag-app",
    feedbacks=feedbacks
)

# Run evaluation
questions = [
    "What is machine learning?",
    "Explain neural networks",
    "What are transformers?"
]

records = []
for question in questions:
    with tru_chain as recording:
        result = tru_chain.invoke(question)
        records.append({
            "question": question,
            "answer": result["result"],
            "sources": [doc.page_content for doc in result["source_documents"]]
        })

# Get feedback
tru.get_leaderboard(app_ids=["my-rag-app"])

Helicone: LLM Observability Platform

Helicone focuses on tracking LLM requests, costs, latency, and debugging prompts.

Helicone Basic Setup

# Install Helicone
pip install helicone

# Set API key
import helicone
helicone.api_key = "your-helicone-key"

# Wrap OpenAI client
from helicone.openai import HeliconeOpenAI
from openai import OpenAI

# Use Helicone wrapper
client = HeliconeOpenAI(api_key="sk-...")

# All requests are automatically tracked
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

# View in Helicone dashboard
print(f"Request ID: {response.helicone_request_id}")

Helicone with RAG

from helicone.openai import HeliconeOpenAI
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.embeddings import OpenAIEmbeddings

# Initialize Helicone-wrapped client
client = HeliconeOpenAI(api_key="sk-...")

# Create RAG chain with Helicone
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(client=client),
    retriever=vectorstore.as_retriever()
)

# Query with automatic tracking
result = qa.invoke("What is machine learning?")

# Helicone captures:
# - Token usage per request
# - Latency breakdown
# - Cost analysis
# - Prompt caching stats

Comparison: RAGAs vs TruLens vs Helicone

Aspect RAGAs TruLens Helicone
Focus RAG metrics AI observability LLM tracking
Metrics Retrieval + Generation Hallucination, relevance Cost, latency
Integration Standalone LangChain, LlamaIndex Any LLM
LLM-as-Judge Built-in Built-in Via custom
Cost Tracking No Limited Yes
Real-time Batch Real-time Real-time

When to Use Each Tool

Use RAGAs When:

  • You need comprehensive RAG metrics
  • You want automated evaluation
  • You’re comparing different retrieval strategies

Use TruLens When:

  • Building production AI apps
  • Need real-time feedback
  • Using LangChain or LlamaIndex

Use Helicone When:

  • Need cost tracking and optimization
  • Debugging LLM prompts
  • Multi-model management

Bad Practices to Avoid

Bad Practice 1: No Evaluation Before Production

# Bad: Deploying RAG without evaluation
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
qa.invoke("production question")  # No metrics!

Bad Practice 2: Ignoring Retrieval Metrics

# Bad: Only measuring answer quality
# Problem: Don't know if retrieval is the issue

Good Practices Summary

Building Evaluation Datasets

# Good: Diverse evaluation dataset
from datasets import Dataset

eval_data = Dataset.from_dict({
    "question": [
        # Different question types
        "factual", "explanatory", "comparative", 
        "procedural", "opinion"
    ] * 20,
    "answer": [...],
    "contexts": [...],
    "ground_truth": [...]
})

External Resources

Comments