RAG Evaluation Complete Guide: Measuring and Improving Your RAG System

Introduction

Building a RAG (Retrieval-Augmented Generation) system is only half the battle. To ensure it works well, you need to evaluate it rigorously. Poor evaluation leads to hallucinations, irrelevant results, and frustrated users.

This comprehensive guide covers everything about RAG evaluation: metrics, tools, benchmarking, and optimization strategies.

Why RAG Evaluation Matters

The Challenge

RAG systems have multiple failure points:

Component	Failure Mode
Retriever	Retrieves irrelevant documents
Chunker	Creates poor chunks
Embedder	Produces poor embeddings
Generator	Hallucinates answers
Re-ranker	Misses relevant results

Evaluation Impact

Without evaluation you won’t know your system’s quality, have no direction for improvement, and will face growing user complaints as trust erodes. Systematic evaluation turns those unknowns into actionable metrics.

Evaluation Metrics

Retrieval Metrics

Precision@K

def precision_at_k(retrieved: list, relevant: list, k: int) -> float:
    """Calculate precision at k"""
    retrieved_k = retrieved[:k]
    relevant_retrieved = [doc for doc in retrieved_k if doc in relevant]
    return len(relevant_retrieved) / k

Recall@K

def recall_at_k(retrieved: list, relevant: list, k: int) -> float:
    """Calculate recall at k"""
    retrieved_k = retrieved[:k]
    relevant_retrieved = [doc for doc in retrieved_k if doc in relevant]
    return len(relevant_retrieved) / len(relevant) if relevant else 0

Mean Reciprocal Rank (MRR)

def mean_reciprocal_rank(queries: list) -> float:
    """Calculate MRR across queries"""
    reciprocal_ranks = []

    for query in queries:
        for i, doc in enumerate(query.retrieved, 1):
            if doc.is_relevant:
                reciprocal_ranks.append(1 / i)
                break
        else:
            reciprocal_ranks.append(0)

    return sum(reciprocal_ranks) / len(reciprocal_ranks)

Generation Metrics

Faithfulness

Faithfulness measures whether every claim in the generated answer is supported by the retrieved context. It is calculated by breaking the response into individual claims, checking each against the context, then dividing supported claims by total claims. A score of 1.0 means no hallucination; 0.0 means nothing is grounded.

Answer Relevance

Answer relevance measures how directly the response addresses the original question, independent of factual accuracy. A high-faithfulness but low-relevance answer is factually grounded but off-topic.

Evaluation Tools

RAGAs

RAGAs is the standard library for offline RAG evaluation. As of 2026 it uses a collections-based API where each metric is a class instantiated with an LLM judge:

import asyncio
from openai import AsyncOpenAI
from ragas.llms import llm_factory
from ragas.metrics.collections import (
    Faithfulness,
    ResponseRelevancy,
    ContextPrecision,
    ContextRecall,
)
from ragas.dataset_schema import SingleTurnSample
from ragas import evaluate, EvaluationDataset

client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)

# Build metric scorers
faithfulness_scorer = Faithfulness(llm=llm)
relevancy_scorer = ResponseRelevancy(llm=llm)

# Score a single sample directly
sample = SingleTurnSample(
    user_input="What is RAG?",
    response="RAG stands for Retrieval-Augmented Generation...",
    retrieved_contexts=["RAG is a technique that combines retrieval with generation..."],
    reference="RAG is a technique that retrieves documents to ground LLM responses.",
)

score = asyncio.run(faithfulness_scorer.ascore(sample))
print(f"Faithfulness: {score.value}")

# Or evaluate a full dataset
dataset = EvaluationDataset(samples=[sample])
results = evaluate(
    dataset,
    metrics=[faithfulness_scorer, relevancy_scorer],
)
print(results)

The four core RAG metrics are:

Metric	What it measures	Requires ground truth?
Faithfulness	Claims in response supported by context	No
Response Relevancy	Response addresses the question	No
Context Precision	Retrieved docs ranked by relevance	Yes
Context Recall	Relevant info present in context	Yes

TruLens

TruLens 2.7+ (April 2026) introduced a unified Metric class that replaces the older Feedback and TruLlama APIs. Use it for tracing and evaluating LangChain or LlamaIndex apps at runtime:

import numpy as np
from trulens.core import Metric, Selector, TruSession
from trulens.providers.openai import OpenAI
from trulens.apps.langchain import TruChain

provider = OpenAI()

# Define metrics with explicit selectors (new API)
answer_relevance = Metric(
    implementation=provider.relevance_with_cot_reasons,
    name="Answer Relevance",
    selectors={
        "prompt": Selector.select_record_input(),
        "response": Selector.select_record_output(),
    },
)

context_relevance = Metric(
    implementation=provider.context_relevance_with_cot_reasons,
    name="Context Relevance",
    selectors={
        "question": Selector.select_record_input(),
        "context": Selector.select_context(collect_list=False),
    },
    agg=np.mean,
)

groundedness = Metric(
    implementation=provider.groundedness_measure_with_cot_reasons,
    name="Groundedness",
    selectors={
        "source": Selector.select_context(),
        "statement": Selector.select_record_output(),
    },
    agg=np.mean,
)

# Instrument your app
session = TruSession()
tru_app = TruChain(
    rag_chain,
    app_name="my-rag",
    metrics=[answer_relevance, context_relevance, groundedness],
)

# Run and record
with tru_app:
    response = rag_chain.invoke("What is RAG?")

session.get_leaderboard()

TruLens 2.7 also supports MLflow integration — you can pass TruLens metrics directly to mlflow.genai.evaluate without adapter code.

LangSmith

from langsmith import evaluate, Client

client = Client()

# Define evaluation dataset
dataset = client.create_dataset(
    "RAG Evaluation",
    data=[
        {"question": "What is RAG?", "answer": "RAG combines retrieval...", "contexts": ["..."]}
    ]
)

# Run evaluation
results = evaluate(
    lambda x: rag_pipeline.invoke(x["question"]),
    data=dataset,
    metrics=["faithfulness", "answer_relevancy"],
)

Building Test Datasets

Creating Ground Truth

def create_test_dataset(documents: list, questions_per_doc: int = 5) -> list:
    """Create test dataset from documents"""
    test_cases = []

    for doc in documents:
        questions = generate_questions(doc, n=questions_per_doc)

        for question in questions:
            answer = generate_answer(doc, question)
            test_cases.append({
                "question": question,
                "answer": answer,
                "contexts": [doc],
                "ground_truth": answer,
            })

    return test_cases

Question Types to Cover

A strong test set covers multiple question categories:

Type	Description	Example
Factoid	Specific facts from documents	“What year was X founded?”
Definition	Explanations of concepts	“What is RAG?”
Process	How something works	“How does vector indexing work?”
Comparison	Differences between things	“What’s the difference between X and Y?”
Edge case	Boundary or out-of-scope inputs	“What happens if no documents match?”

Aim for at least 50 questions per category for statistically meaningful results.

Benchmarking Strategies

A/B Testing

from scipy import stats

def ab_test(rag_v1, rag_v2, test_dataset: list) -> dict:
    """Compare two RAG systems"""
    results_v1 = evaluate_system(rag_v1, test_dataset)
    results_v2 = evaluate_system(rag_v2, test_dataset)

    t_stat, p_value = stats.ttest_ind(
        results_v1.scores,
        results_v2.scores,
    )

    return {
        "v1_score": results_v1.avg_score,
        "v2_score": results_v2.avg_score,
        "improvement": results_v2.avg_score - results_v1.avg_score,
        "significant": p_value < 0.05,
    }

Regression Testing

import pandas as pd
from datetime import datetime

def track_metrics(system, test_dataset: list, runs: int = 10) -> dict:
    """Track metrics across multiple evaluation runs"""
    results = []

    for i in range(runs):
        score = evaluate_system(system, test_dataset)
        results.append({
            "run": i,
            "faithfulness": score.faithfulness,
            "relevancy": score.relevancy,
            "timestamp": datetime.now(),
        })

    df = pd.DataFrame(results)
    return {
        "mean": df["faithfulness"].mean(),
        "std": df["faithfulness"].std(),
        "data": df,
    }

Common Issues and Fixes

Issue 1: Retrieved docs not relevant

The most common cause is fixed-size chunking that splits semantic units across boundaries. Switch to semantic chunking:

# Before: fixed-size chunks
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(chunk_size=500)

# After: semantic chunking
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
)

Issue 2: Poor answer quality

A structured prompt with a worked example dramatically reduces vague or hallucinated answers:

# Before: minimal prompt
prompt = "Answer based on context: {context}\nQuestion: {question}"

# After: structured prompt with example
prompt = """You are a helpful assistant. Use only the provided context to answer.

Example:
Context: The company was founded in 2020.
Question: When was it founded?
Answer: The company was founded in 2020.

Now answer:
Context: {context}
Question: {question}
Answer:"""

Issue 3: High latency

Cache embeddings and parallelize retrieval for independent sub-queries:

import asyncio
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_cached_embedding(text: str):
    return embedding_model.encode(text)

async def async_retrieve(query):
    tasks = [get_relevant_docs(chunk) for chunk in query.chunks]
    results = await asyncio.gather(*tasks)
    return combine_results(results)

Production Monitoring

Key Metrics Dashboard

Metric	Target	Alert Threshold
Retrieval latency P95	< 200ms	> 500ms
Generation latency P95	< 2s	> 5s
Faithfulness	> 0.85	< 0.75
Answer relevancy	> 0.80	< 0.70
Error rate	< 1%	> 5%
User satisfaction (thumbs up)	> 70%	< 50%

Alerting

def check_metrics(current: dict, baseline: dict) -> list[str]:
    """Return list of alert messages for metric regressions."""
    alerts = []

    if current["faithfulness"] < baseline["faithfulness"] * 0.9:
        alerts.append("Faithfulness dropped by >10%")

    if current["p95_latency"] > baseline["p95_latency"] * 1.5:
        alerts.append("P95 latency increased by >50%")

    if current["error_rate"] > 0.05:
        alerts.append("Error rate above 5%")

    return alerts

Evaluation Tools Comparison

Aspect	RAGAs	TruLens	LangSmith
Primary use	Offline batch evaluation	Runtime tracing + eval	Experiment tracking + eval
Setup complexity	Low	Medium	Medium
Ground truth required	Optional (2 of 4 metrics)	No	Optional
Production monitoring	No	Yes	Yes
LLM framework support	Framework-agnostic	LangChain, LlamaIndex, LangGraph	LangChain, custom
Cost tracking	No	Yes (TruLens 2.7+)	Yes
Open source	Yes	Yes	No (hosted)

Choose RAGAs for fast offline evaluation during development, TruLens for runtime monitoring and iteration, and LangSmith if you are already in the LangChain ecosystem and want unified experiment tracking.

Conclusion

RAG evaluation is essential for production systems. Use the right metrics, build comprehensive test sets, and monitor in production.

Key takeaways:

Measure both retrieval and generation — end-to-end metrics matter more than either alone
Build diverse test sets — cover factoid, definition, process, comparison, and edge-case questions
Monitor in production — catch regressions before users report them
Iterate based on data — a low context-recall score points to the retriever; a low faithfulness score points to the generator

RAG Evaluation Complete Guide: Measuring and Improving Your RAG System

Introduction

Why RAG Evaluation Matters

The Challenge

Evaluation Impact

Evaluation Metrics

Retrieval Metrics

Precision@K

Recall@K

Mean Reciprocal Rank (MRR)

Generation Metrics

Faithfulness

Answer Relevance

Evaluation Tools

RAGAs

TruLens

LangSmith

Building Test Datasets

Creating Ground Truth

Question Types to Cover

Benchmarking Strategies

A/B Testing

Regression Testing

Common Issues and Fixes

Issue 1: Retrieved docs not relevant

Issue 2: Poor answer quality

Issue 3: High latency

Production Monitoring

Key Metrics Dashboard

Alerting

Evaluation Tools Comparison

Conclusion

External Resources

Comments

Share this article

👍 Was this article helpful?