Skip to main content

RAG Evaluation Complete Guide: Measuring and Improving Your RAG System

Published: August 2, 2025 Updated: June 22, 2026 Larry Qu 7 min read

Introduction

Building a RAG (Retrieval-Augmented Generation) system is only half the battle. To ensure it works well, you need to evaluate it rigorously. Poor evaluation leads to hallucinations, irrelevant results, and frustrated users.

This comprehensive guide covers everything about RAG evaluation: metrics, tools, benchmarking, and optimization strategies.


Why RAG Evaluation Matters

The Challenge

RAG systems have multiple failure points:

Component Failure Mode
Retriever Retrieves irrelevant documents
Chunker Creates poor chunks
Embedder Produces poor embeddings
Generator Hallucinates answers
Re-ranker Misses relevant results

Evaluation Impact

Without evaluation you won’t know your system’s quality, have no direction for improvement, and will face growing user complaints as trust erodes. Systematic evaluation turns those unknowns into actionable metrics.


Evaluation Metrics

Retrieval Metrics

Precision@K

def precision_at_k(retrieved: list, relevant: list, k: int) -> float:
    """Calculate precision at k"""
    retrieved_k = retrieved[:k]
    relevant_retrieved = [doc for doc in retrieved_k if doc in relevant]
    return len(relevant_retrieved) / k

Recall@K

def recall_at_k(retrieved: list, relevant: list, k: int) -> float:
    """Calculate recall at k"""
    retrieved_k = retrieved[:k]
    relevant_retrieved = [doc for doc in retrieved_k if doc in relevant]
    return len(relevant_retrieved) / len(relevant) if relevant else 0

Mean Reciprocal Rank (MRR)

def mean_reciprocal_rank(queries: list) -> float:
    """Calculate MRR across queries"""
    reciprocal_ranks = []

    for query in queries:
        for i, doc in enumerate(query.retrieved, 1):
            if doc.is_relevant:
                reciprocal_ranks.append(1 / i)
                break
        else:
            reciprocal_ranks.append(0)

    return sum(reciprocal_ranks) / len(reciprocal_ranks)

Generation Metrics

Faithfulness

Faithfulness measures whether every claim in the generated answer is supported by the retrieved context. It is calculated by breaking the response into individual claims, checking each against the context, then dividing supported claims by total claims. A score of 1.0 means no hallucination; 0.0 means nothing is grounded.

Answer Relevance

Answer relevance measures how directly the response addresses the original question, independent of factual accuracy. A high-faithfulness but low-relevance answer is factually grounded but off-topic.


Evaluation Tools

RAGAs

RAGAs is the standard library for offline RAG evaluation. As of 2026 it uses a collections-based API where each metric is a class instantiated with an LLM judge:

import asyncio
from openai import AsyncOpenAI
from ragas.llms import llm_factory
from ragas.metrics.collections import (
    Faithfulness,
    ResponseRelevancy,
    ContextPrecision,
    ContextRecall,
)
from ragas.dataset_schema import SingleTurnSample
from ragas import evaluate, EvaluationDataset

client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)

# Build metric scorers
faithfulness_scorer = Faithfulness(llm=llm)
relevancy_scorer = ResponseRelevancy(llm=llm)

# Score a single sample directly
sample = SingleTurnSample(
    user_input="What is RAG?",
    response="RAG stands for Retrieval-Augmented Generation...",
    retrieved_contexts=["RAG is a technique that combines retrieval with generation..."],
    reference="RAG is a technique that retrieves documents to ground LLM responses.",
)

score = asyncio.run(faithfulness_scorer.ascore(sample))
print(f"Faithfulness: {score.value}")

# Or evaluate a full dataset
dataset = EvaluationDataset(samples=[sample])
results = evaluate(
    dataset,
    metrics=[faithfulness_scorer, relevancy_scorer],
)
print(results)

The four core RAG metrics are:

Metric What it measures Requires ground truth?
Faithfulness Claims in response supported by context No
Response Relevancy Response addresses the question No
Context Precision Retrieved docs ranked by relevance Yes
Context Recall Relevant info present in context Yes

TruLens

TruLens 2.7+ (April 2026) introduced a unified Metric class that replaces the older Feedback and TruLlama APIs. Use it for tracing and evaluating LangChain or LlamaIndex apps at runtime:

import numpy as np
from trulens.core import Metric, Selector, TruSession
from trulens.providers.openai import OpenAI
from trulens.apps.langchain import TruChain

provider = OpenAI()

# Define metrics with explicit selectors (new API)
answer_relevance = Metric(
    implementation=provider.relevance_with_cot_reasons,
    name="Answer Relevance",
    selectors={
        "prompt": Selector.select_record_input(),
        "response": Selector.select_record_output(),
    },
)

context_relevance = Metric(
    implementation=provider.context_relevance_with_cot_reasons,
    name="Context Relevance",
    selectors={
        "question": Selector.select_record_input(),
        "context": Selector.select_context(collect_list=False),
    },
    agg=np.mean,
)

groundedness = Metric(
    implementation=provider.groundedness_measure_with_cot_reasons,
    name="Groundedness",
    selectors={
        "source": Selector.select_context(),
        "statement": Selector.select_record_output(),
    },
    agg=np.mean,
)

# Instrument your app
session = TruSession()
tru_app = TruChain(
    rag_chain,
    app_name="my-rag",
    metrics=[answer_relevance, context_relevance, groundedness],
)

# Run and record
with tru_app:
    response = rag_chain.invoke("What is RAG?")

session.get_leaderboard()

TruLens 2.7 also supports MLflow integration — you can pass TruLens metrics directly to mlflow.genai.evaluate without adapter code.

LangSmith

from langsmith import evaluate, Client

client = Client()

# Define evaluation dataset
dataset = client.create_dataset(
    "RAG Evaluation",
    data=[
        {"question": "What is RAG?", "answer": "RAG combines retrieval...", "contexts": ["..."]}
    ]
)

# Run evaluation
results = evaluate(
    lambda x: rag_pipeline.invoke(x["question"]),
    data=dataset,
    metrics=["faithfulness", "answer_relevancy"],
)

Building Test Datasets

Creating Ground Truth

def create_test_dataset(documents: list, questions_per_doc: int = 5) -> list:
    """Create test dataset from documents"""
    test_cases = []

    for doc in documents:
        questions = generate_questions(doc, n=questions_per_doc)

        for question in questions:
            answer = generate_answer(doc, question)
            test_cases.append({
                "question": question,
                "answer": answer,
                "contexts": [doc],
                "ground_truth": answer,
            })

    return test_cases

Question Types to Cover

A strong test set covers multiple question categories:

Type Description Example
Factoid Specific facts from documents “What year was X founded?”
Definition Explanations of concepts “What is RAG?”
Process How something works “How does vector indexing work?”
Comparison Differences between things “What’s the difference between X and Y?”
Edge case Boundary or out-of-scope inputs “What happens if no documents match?”

Aim for at least 50 questions per category for statistically meaningful results.


Benchmarking Strategies

A/B Testing

from scipy import stats

def ab_test(rag_v1, rag_v2, test_dataset: list) -> dict:
    """Compare two RAG systems"""
    results_v1 = evaluate_system(rag_v1, test_dataset)
    results_v2 = evaluate_system(rag_v2, test_dataset)

    t_stat, p_value = stats.ttest_ind(
        results_v1.scores,
        results_v2.scores,
    )

    return {
        "v1_score": results_v1.avg_score,
        "v2_score": results_v2.avg_score,
        "improvement": results_v2.avg_score - results_v1.avg_score,
        "significant": p_value < 0.05,
    }

Regression Testing

import pandas as pd
from datetime import datetime

def track_metrics(system, test_dataset: list, runs: int = 10) -> dict:
    """Track metrics across multiple evaluation runs"""
    results = []

    for i in range(runs):
        score = evaluate_system(system, test_dataset)
        results.append({
            "run": i,
            "faithfulness": score.faithfulness,
            "relevancy": score.relevancy,
            "timestamp": datetime.now(),
        })

    df = pd.DataFrame(results)
    return {
        "mean": df["faithfulness"].mean(),
        "std": df["faithfulness"].std(),
        "data": df,
    }

Common Issues and Fixes

Issue 1: Retrieved docs not relevant

The most common cause is fixed-size chunking that splits semantic units across boundaries. Switch to semantic chunking:

# Before: fixed-size chunks
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(chunk_size=500)

# After: semantic chunking
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
)

Issue 2: Poor answer quality

A structured prompt with a worked example dramatically reduces vague or hallucinated answers:

# Before: minimal prompt
prompt = "Answer based on context: {context}\nQuestion: {question}"

# After: structured prompt with example
prompt = """You are a helpful assistant. Use only the provided context to answer.

Example:
Context: The company was founded in 2020.
Question: When was it founded?
Answer: The company was founded in 2020.

Now answer:
Context: {context}
Question: {question}
Answer:"""

Issue 3: High latency

Cache embeddings and parallelize retrieval for independent sub-queries:

import asyncio
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_cached_embedding(text: str):
    return embedding_model.encode(text)

async def async_retrieve(query):
    tasks = [get_relevant_docs(chunk) for chunk in query.chunks]
    results = await asyncio.gather(*tasks)
    return combine_results(results)

Production Monitoring

Key Metrics Dashboard

Metric Target Alert Threshold
Retrieval latency P95 < 200ms > 500ms
Generation latency P95 < 2s > 5s
Faithfulness > 0.85 < 0.75
Answer relevancy > 0.80 < 0.70
Error rate < 1% > 5%
User satisfaction (thumbs up) > 70% < 50%

Alerting

def check_metrics(current: dict, baseline: dict) -> list[str]:
    """Return list of alert messages for metric regressions."""
    alerts = []

    if current["faithfulness"] < baseline["faithfulness"] * 0.9:
        alerts.append("Faithfulness dropped by >10%")

    if current["p95_latency"] > baseline["p95_latency"] * 1.5:
        alerts.append("P95 latency increased by >50%")

    if current["error_rate"] > 0.05:
        alerts.append("Error rate above 5%")

    return alerts

Evaluation Tools Comparison

Aspect RAGAs TruLens LangSmith
Primary use Offline batch evaluation Runtime tracing + eval Experiment tracking + eval
Setup complexity Low Medium Medium
Ground truth required Optional (2 of 4 metrics) No Optional
Production monitoring No Yes Yes
LLM framework support Framework-agnostic LangChain, LlamaIndex, LangGraph LangChain, custom
Cost tracking No Yes (TruLens 2.7+) Yes
Open source Yes Yes No (hosted)

Choose RAGAs for fast offline evaluation during development, TruLens for runtime monitoring and iteration, and LangSmith if you are already in the LangChain ecosystem and want unified experiment tracking.


Conclusion

RAG evaluation is essential for production systems. Use the right metrics, build comprehensive test sets, and monitor in production.

Key takeaways:

  1. Measure both retrieval and generation — end-to-end metrics matter more than either alone
  2. Build diverse test sets — cover factoid, definition, process, comparison, and edge-case questions
  3. Monitor in production — catch regressions before users report them
  4. Iterate based on data — a low context-recall score points to the retriever; a low faithfulness score points to the generator

External Resources

Comments

👍 Was this article helpful?