Introduction
How do you know if your AI agent is actually good? Unlike traditional software where you can write tests for specific outputs, AI agents are probabilistic and can produce wildly different responses. This makes evaluation challenging but critical.
This guide covers everything about evaluating AI agents: benchmarks, metrics, testing frameworks, and building robust evaluation systems.
Why Agent Evaluation Matters
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ AGENT EVALUATION CHALLENGES โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Traditional Software AI Agents โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโ โ
โ โ
โ Input โ Output Input โ Agent โ Output โ
โ Deterministic Probabilistic โ
โ Testable Hard to test โ
โ Binary pass/fail Grayscale quality โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Evaluation is critical for: โ โ
โ โ โข Deployment decisions โ โ
โ โ โข Model selection โ โ
โ โ โข Performance monitoring โ โ
โ โ โข Bug detection โ โ
โ โ โข Safety assurance โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Evaluation Dimensions
Key Metrics
| Dimension | Metrics | What It Measures |
|---|---|---|
| Accuracy | Task success, Correctness | Is the output right? |
| Efficiency | Latency, Token count | How fast/cheap? |
| Reliability | Consistency, Error rate | Does it work reliably? |
| Safety | Harmful content, Jailbreaks | Is it safe? |
| Helpfulness | User satisfaction | Does it help users? |
Benchmark Frameworks
1. AgentBench
# AgentBench evaluation
import agentbench
# Define environment
env = agentbench.Environment(
name="os",
config={"task": "file_operations"}
)
# Run agent
agent = YourAgent()
results = []
for task in agentbench.tasks:
# Execute task
result = await agent.execute(task.prompt)
# Evaluate
score = task.evaluate(result)
results.append({
"task": task.name,
"success": score.success,
"score": score.value,
"metrics": score.metrics
})
# Aggregate results
print(f"AgentBench Score: {sum(r['score'] for r in results) / len(results)}")
2. WebArena
# WebArena evaluation for web agents
from webarena import WebArenaEnv, Task
env = WebArenaEnv(
sites=["amazon", "reddit", "github"],
docker_image="webarena/eval"
)
tasks = [
Task("Find the price of iPhone on Amazon"),
Task("Post a comment on Reddit"),
Task("Create a repo on GitHub")
]
results = []
for task in tasks:
env.reset(target_site=task.site)
# Execute
result = await agent.execute(task.description)
# Verify
success = task.verify(env, result)
results.append({"task": task.description, "success": success})
print(f"WebArena Success Rate: {sum(r['success'] for r in results) / len(results)}")
3. GAIA
# GAIA benchmark for general AI assistants
from gaia import GAIABenchmark
benchmark = GAIABenchmark(level="all")
results = await benchmark.evaluate(agent)
print(f"""
GAIA Results:
- Level 1: {results.level1_accuracy}%
- Level 2: {results.level2_accuracy}%
- Level 3: {results.level3_accuracy}%
- Overall: {results.overall_score}%
""")
Building Evaluation Systems
1. Custom Evaluation Framework
from dataclasses import dataclass
from typing import List, Dict, Any
import asyncio
@dataclass
class EvaluationResult:
task: str
success: bool
score: float
metrics: Dict[str, float]
errors: List[str]
class AgentEvaluator:
def __init__(self, agent, evaluators: List[callable]):
self.agent = agent
self.evaluators = evaluators
async def evaluate_task(self, task: Task) -> EvaluationResult:
# Run agent
start = time.time()
result = await self.agent.execute(task.input)
duration = time.time() - start
# Run evaluators
eval_results = []
for evaluator in self.evaluators:
eval_results.append(await evaluator(task, result, duration))
# Aggregate
return EvaluationResult(
task=task.name,
success=all(e.success for e in eval_results),
score=sum(e.score for e in eval_results) / len(eval_results),
metrics={
"duration": duration,
"tokens": result.token_count,
**{e.name: e.score for e in eval_results}
},
errors=[e.error for e in eval_results if e.error]
)
async def evaluate_dataset(self, tasks: List[Task]) -> Dict[str, Any]:
results = []
for task in asyncio.gather(*[
self.evaluate_task(t) for t in tasks
]):
results.append(task)
# Compute aggregate metrics
return {
"total_tasks": len(tasks),
"success_rate": sum(1 for r in results if r.success) / len(results),
"avg_score": sum(r.score for r in results) / len(results),
"avg_duration": sum(r.metrics["duration"] for r in results) / len(results),
"avg_tokens": sum(r.metrics["tokens"] for r in results) / len(results),
"error_types": self._categorize_errors(results)
}
2. Task-Specific Evaluators
# Correctness evaluator
class CorrectnessEvaluator:
def __init__(self, expected_output: Any):
self.expected = expected_output
async def evaluate(self, task: Task, result: Any, duration: float) -> EvalResult:
# Exact match
if self.expected == result.output:
return EvalResult(name="correctness", score=1.0, success=True)
# Partial match (for text)
if isinstance(result.output, str):
similarity = self._string_similarity(result.output, self.expected)
return EvalResult(
name="correctness",
score=similarity,
success=similarity > 0.8
)
return EvalResult(name="correctness", score=0.0, success=False)
# Code execution evaluator
class CodeExecutionEvaluator:
def __init__(self, test_cases: List[Dict]):
self.test_cases = test_cases
async def evaluate(self, task: Task, result: Any, duration: float) -> EvalResult:
if not result.code:
return EvalResult(name="code_execution", score=0, success=False)
passed = 0
for test in self.test_cases:
try:
output = execute(result.code, test["input"])
if output == test["expected"]:
passed += 1
except:
pass
score = passed / len(self.test_cases)
return EvalResult(
name="code_execution",
score=score,
success=score == 1.0
)
3. LLM-as-Judge
# Use LLM to evaluate responses
class LLMJudge:
def __init__(self, judge_llm):
self.llm = judge_llm
async def evaluate(self, task: Task, result: Any) -> EvalResult:
prompt = f"""
Evaluate this AI agent's response to the following task:
Task: {task.input}
Agent Response: {result.output}
Evaluate on a scale of 1-10 for:
1. Correctness - Is the response factually correct?
2. Completeness - Does it fully address the task?
3. Clarity - Is it clear and well-structured?
4. Helpfulness - Does it provide useful information?
Return your evaluation as JSON:
{{
"correctness": 8,
"completeness": 7,
"clarity": 9,
"helpfulness": 8,
"overall": 8,
"reasoning": "..."
}}
"""
response = await self.llm.generate(prompt)
evaluation = json.loads(response)
return EvalResult(
name="llm_judge",
score=evaluation["overall"] / 10,
success=evaluation["overall"] >= 7,
details=evaluation
)
Metrics Deep Dive
1. Task Success Rate
# Calculate success rate
def calculate_success_rate(results: List[EvalResult]) -> float:
successful = sum(1 for r in results if r.success)
return successful / len(results) if results else 0
# Weighted success rate
def weighted_success_rate(results: List[EvalResult], weights: Dict) -> float:
weighted = 0
total_weight = 0
for result in results:
weight = weights.get(result.task, 1)
weighted += (1 if result.success else 0) * weight
total_weight += weight
return weighted / total_weight if total_weight else 0
2. Efficiency Metrics
# Calculate efficiency
class EfficiencyMetrics:
def __init__(self, results: List[EvalResult]):
self.results = results
@property
def avg_latency(self) -> float:
return sum(r.metrics.get("duration", 0) for r in self.results) / len(self.results)
@property
def p95_latency(self) -> float:
latencies = sorted([r.metrics.get("duration", 0) for r in self.results])
return latencies[int(len(latencies) * 0.95)]
@property
def avg_tokens(self) -> float:
return sum(r.metrics.get("tokens", 0) for r in self.results) / len(self.results)
@property
def cost_per_task(self) -> float:
# Assuming $0.01 per 1K tokens
tokens = sum(r.metrics.get("tokens", 0) for r in self.results)
return (tokens / 1000) * 0.01 / len(self.results)
3. Reliability Metrics
# Calculate reliability
class ReliabilityMetrics:
def __init__(self, results: List[EvalResult]):
self.results = results
@property
def consistency_score(self) -> float:
# Run same task multiple times
# Measure variance in scores
task_scores = {}
for r in self.results:
if r.task not in task_scores:
task_scores[r.task] = []
task_scores[r.task].append(r.score)
# Coefficient of variation (lower = more consistent)
variations = []
for scores in task_scores.values():
if len(scores) > 1:
mean = sum(scores) / len(scores)
std = (sum((s - mean) ** 2 for s in scores) / len(scores)) ** 0.5
variations.append(std / mean if mean else 0)
return 1 - (sum(variations) / len(variations)) if variations else 1
@property
def error_rate(self) -> float:
errors = sum(len(r.errors) for r in self.results)
return errors / len(self.results)
Testing Strategies
1. Unit Testing Agents
# Test individual components
import pytest
class TestAgentTools:
@pytest.mark.asyncio
async def test_search_tool(self):
tool = SearchTool()
result = await tool.execute("AI agents")
assert result is not None
assert len(result) > 0
assert "agent" in result[0].lower()
@pytest.mark.asyncio
async def test_file_tool(self):
tool = FileTool()
# Write
await tool.write("test.txt", "hello")
# Read
content = await tool.read("test.txt")
assert content == "hello"
class TestAgentLogic:
@pytest.mark.asyncio
async def test_task_decomposition(self):
agent = PlanningAgent()
task = "Write a Python function to sort a list"
plan = await agent.decompose(task)
assert len(plan.steps) > 0
assert any("function" in step.lower() for step in plan.steps)
2. Integration Testing
# Test full agent workflows
@pytest.mark.asyncio
async def test_customer_support_workflow():
agent = CustomerSupportAgent()
# Simulate conversation
responses = []
response1 = await agent.handle("I can't login")
responses.append(response1)
assert "password" in response1.text.lower() or "reset" in response1.text.lower()
response2 = await agent.handle("Yes, I've tried that")
responses.append(response2)
assert "account" in response2.text.lower() or "support" in response2.text.lower()
# Verify resolution
assert len(responses) <= 5 # Should resolve within 5 turns
3. Regression Testing
# Compare against baseline
class RegressionTest:
def __init__(self, baseline_results: Dict):
self.baseline = baseline_results
async def run(self, current_results: Dict) -> RegressionReport:
report = RegressionReport()
for task, baseline in self.baseline.items():
current = current_results.get(task)
if current:
delta = current.score - baseline.score
report.add_task(task, baseline.score, current.score, delta)
return report
def has_regression(self, report: RegressionReport, threshold: float = 0.1) -> bool:
return any(r.delta < -threshold for r in report.results)
Evaluation Datasets
Public Benchmarks
| Benchmark | Focus | Tasks | URL |
|---|---|---|---|
| AgentBench | General agents | 7 domains | agentbench.github.io |
| WebArena | Web interaction | 6 sites | webarena.dev |
| GAIA | General assistance | 466 | gaia-benchmark.github.io |
| APPS | Code generation | 164 | github.com/chengr28/APPS |
| HumanEval | Code generation | 164 | github.com/openai/human-eval |
| MMLU | Knowledge | 57 subjects | github.com/hendrycks/test |
Building Custom Datasets
class DatasetBuilder:
def __init__(self):
self.tasks = []
def add_task(
self,
name: str,
input: str,
expected_output: Any,
constraints: List[str] = None,
metadata: Dict = None
):
self.tasks.append(Task(
name=name,
input=input,
expected_output=expected_output,
constraints=constraints or [],
metadata=metadata or {}
))
def export(self, path: str):
with open(path, 'w') as f:
json.dump([t.to_dict() for t in self.tasks], f, indent=2)
def load(self, path: str):
with open(path, 'r') as f:
data = json.load(f)
self.tasks = [Task.from_dict(d) for d in data]
Continuous Evaluation
Production Monitoring
# Monitor agent in production
class ProductionMonitor:
def __init__(self, agent):
self.agent = agent
self.metrics = MetricsCollector()
async def track_request(self, request: Request) -> Response:
start = time.time()
try:
response = await self.agent.handle(request)
success = True
error = None
except Exception as e:
response = None
success = False
error = str(e)
duration = time.time() - start
# Record metrics
await self.metrics.record(
request=request.input,
response=response,
duration=duration,
success=success,
error=error
)
return response
async def generate_report(self) -> Report:
return await self.metrics.generate_report()
Best Practices
Good: Comprehensive Evaluation
# Good: Test multiple dimensions
async def evaluate_agent(agent):
results = {
"correctness": await evaluate_correctness(agent),
"efficiency": await evaluate_efficiency(agent),
"safety": await evaluate_safety(agent),
"helpfulness": await evaluate_helpfulness(agent)
}
return Results(
overall=weighted_average(results),
dimensions=results
)
Bad: Single Metric
# Bad: Only measuring accuracy
accuracy = sum(1 for r in results if r.success) / len(results)
# Misses: cost, latency, safety, user satisfaction
Good: A/B Testing
# Compare agents in production
class ABTester:
async def test(self, agent_a, agent_b, traffic_split=0.5):
results_a = []
results_b = []
for request in stream_requests():
agent = agent_a if random() < traffic_split else agent_b
result = await agent.handle(request)
if agent == agent_a:
results_a.append(result)
else:
results_b.append(result)
return {
"agent_a": summarize(results_a),
"agent_b": summarize(results_b),
"winner": "a" if avg_score(results_a) > avg_score(results_b) else "b"
}
Conclusion
Agent evaluation requires a multi-dimensional approach:
- Use established benchmarks - AgentBench, WebArena, GAIA
- Build custom evaluators - Task-specific correctness checks
- Measure comprehensively - Accuracy, efficiency, reliability, safety
- Monitor in production - Track real-world performance
- Iterate continuously - A/B test and improve
The right evaluation strategy depends on your use case. Start with existing benchmarks, then build custom evaluation for your specific requirements.
Related Articles
- Building Production AI Agents
- AI Agent Memory Systems
- AI Agent Trends 2026
- Introduction to Agentic AI
Comments