Evaluating Agent Systems Beyond Accuracy
Ask “how accurate is your agent?” and you’ll get a number that tells you almost nothing useful. An agent that’s 95% accurate but costs $2 per task and takes 30 seconds is worse than one that’s 90% accurate at $0.10 in 3 seconds — for most use cases.
The Evaluation Matrix
Agent evaluation needs at least four dimensions:
| Dimension | What It Measures | Why It Matters |
|---|---|---|
| Correctness | Did the agent produce the right result? | Table stakes — but insufficient alone |
| Cost | Tokens consumed, API calls made | Determines unit economics |
| Latency | Time from request to response | Determines user experience |
| Reliability | Consistency across similar inputs | Determines trust |
A single accuracy number collapses all of this into one dimension. Don’t let it.
Cost-Normalized Accuracy
Instead of raw accuracy, track cost-normalized metrics:
def cost_normalized_score(results):
correct = sum(1 for r in results if r.correct)
total_cost = sum(r.cost for r in results)
accuracy = correct / len(results)
cost_per_correct = total_cost / max(correct, 1)
return {
"accuracy": accuracy,
"cost_per_task": total_cost / len(results),
"cost_per_correct_task": cost_per_correct,
"efficiency_score": accuracy / cost_per_correct, # Higher is better
}
This reveals the actual trade-off: is a 5% accuracy improvement worth a 3x cost increase?
Reliability Over Accuracy
A model that scores 92% on a benchmark but occasionally produces catastrophically wrong answers is less useful than one that scores 88% but fails predictably and gracefully.
Measure reliability as:
- Variance across runs on the same input
- Failure mode distribution — are failures minor or catastrophic?
- Recovery rate — when the agent fails, can it self-correct?
- Degradation curve — how does performance change under load, with longer contexts, or with adversarial inputs?
Building Evaluation Datasets
The hardest part of agent evaluation is building representative test cases:
- Start with production logs — real user requests are the best test cases
- Categorize by difficulty — trivial, standard, complex, edge case
- Include failure modes — ambiguous requests, missing information, contradictory instructions
- Version your dataset — as the agent evolves, so should the tests
Avoid the temptation to only test the happy path. The edge cases are where agents differ most.
Continuous Evaluation
Evaluation isn’t a one-time activity. Run it continuously:
- Pre-deploy: full test suite against the evaluation dataset
- Shadow mode: run new versions alongside production, compare outputs
- Production: sample live requests for automated quality scoring
- Regression: track metrics over time, alert on degradation
The goal is catching regressions before users do — and having the data to understand why performance changed.
The Human Element
Some aspects of agent quality can only be evaluated by humans:
- Was the response helpful, or just correct?
- Did the agent ask the right clarifying questions?
- Would the user trust this agent with the same task again?
Build human evaluation into your pipeline. Even a small sample (50-100 interactions per week) provides signal that automated metrics miss entirely.
Subscribe to the newsletter
Get notified when I publish new articles about agent systems and AI engineering. No spam, unsubscribe anytime.