Evaluating Agent Systems Beyond Accuracy

Ask “how accurate is your agent?” and you’ll get a number that tells you almost nothing useful. An agent that’s 95% accurate but costs $2 per task and takes 30 seconds is worse than one that’s 90% accurate at $0.10 in 3 seconds — for most use cases.

The Evaluation Matrix

Agent evaluation needs at least four dimensions:

Dimension	What It Measures	Why It Matters
Correctness	Did the agent produce the right result?	Table stakes — but insufficient alone
Cost	Tokens consumed, API calls made	Determines unit economics
Latency	Time from request to response	Determines user experience
Reliability	Consistency across similar inputs	Determines trust

A single accuracy number collapses all of this into one dimension. Don’t let it.

Cost-Normalized Accuracy

Instead of raw accuracy, track cost-normalized metrics:

def cost_normalized_score(results):
    correct = sum(1 for r in results if r.correct)
    total_cost = sum(r.cost for r in results)

    accuracy = correct / len(results)
    cost_per_correct = total_cost / max(correct, 1)

    return {
        "accuracy": accuracy,
        "cost_per_task": total_cost / len(results),
        "cost_per_correct_task": cost_per_correct,
        "efficiency_score": accuracy / cost_per_correct,  # Higher is better
    }

This reveals the actual trade-off: is a 5% accuracy improvement worth a 3x cost increase?

Reliability Over Accuracy

A model that scores 92% on a benchmark but occasionally produces catastrophically wrong answers is less useful than one that scores 88% but fails predictably and gracefully.

Measure reliability as:

Variance across runs on the same input
Failure mode distribution — are failures minor or catastrophic?
Recovery rate — when the agent fails, can it self-correct?
Degradation curve — how does performance change under load, with longer contexts, or with adversarial inputs?

Building Evaluation Datasets

The hardest part of agent evaluation is building representative test cases:

Start with production logs — real user requests are the best test cases
Categorize by difficulty — trivial, standard, complex, edge case
Include failure modes — ambiguous requests, missing information, contradictory instructions
Version your dataset — as the agent evolves, so should the tests

Avoid the temptation to only test the happy path. The edge cases are where agents differ most.

Continuous Evaluation

Evaluation isn’t a one-time activity. Run it continuously:

Pre-deploy: full test suite against the evaluation dataset
Shadow mode: run new versions alongside production, compare outputs
Production: sample live requests for automated quality scoring
Regression: track metrics over time, alert on degradation

The goal is catching regressions before users do — and having the data to understand why performance changed.

The Human Element

Some aspects of agent quality can only be evaluated by humans:

Was the response helpful, or just correct?
Did the agent ask the right clarifying questions?
Would the user trust this agent with the same task again?

Build human evaluation into your pipeline. Even a small sample (50-100 interactions per week) provides signal that automated metrics miss entirely.