· Advisory

Evaluating Agent Systems Beyond Accuracy

evaluation agents metrics

Ask “how accurate is your agent?” and you’ll get a number that tells you almost nothing useful. An agent that’s 95% accurate but costs $2 per task and takes 30 seconds is worse than one that’s 90% accurate at $0.10 in 3 seconds — for most use cases.

The Evaluation Matrix

Agent evaluation needs at least four dimensions:

DimensionWhat It MeasuresWhy It Matters
CorrectnessDid the agent produce the right result?Table stakes — but insufficient alone
CostTokens consumed, API calls madeDetermines unit economics
LatencyTime from request to responseDetermines user experience
ReliabilityConsistency across similar inputsDetermines trust

A single accuracy number collapses all of this into one dimension. Don’t let it.

Cost-Normalized Accuracy

Instead of raw accuracy, track cost-normalized metrics:

def cost_normalized_score(results):
    correct = sum(1 for r in results if r.correct)
    total_cost = sum(r.cost for r in results)

    accuracy = correct / len(results)
    cost_per_correct = total_cost / max(correct, 1)

    return {
        "accuracy": accuracy,
        "cost_per_task": total_cost / len(results),
        "cost_per_correct_task": cost_per_correct,
        "efficiency_score": accuracy / cost_per_correct,  # Higher is better
    }

This reveals the actual trade-off: is a 5% accuracy improvement worth a 3x cost increase?

Reliability Over Accuracy

A model that scores 92% on a benchmark but occasionally produces catastrophically wrong answers is less useful than one that scores 88% but fails predictably and gracefully.

Measure reliability as:

  • Variance across runs on the same input
  • Failure mode distribution — are failures minor or catastrophic?
  • Recovery rate — when the agent fails, can it self-correct?
  • Degradation curve — how does performance change under load, with longer contexts, or with adversarial inputs?

Building Evaluation Datasets

The hardest part of agent evaluation is building representative test cases:

  1. Start with production logs — real user requests are the best test cases
  2. Categorize by difficulty — trivial, standard, complex, edge case
  3. Include failure modes — ambiguous requests, missing information, contradictory instructions
  4. Version your dataset — as the agent evolves, so should the tests

Avoid the temptation to only test the happy path. The edge cases are where agents differ most.

Continuous Evaluation

Evaluation isn’t a one-time activity. Run it continuously:

  • Pre-deploy: full test suite against the evaluation dataset
  • Shadow mode: run new versions alongside production, compare outputs
  • Production: sample live requests for automated quality scoring
  • Regression: track metrics over time, alert on degradation

The goal is catching regressions before users do — and having the data to understand why performance changed.

The Human Element

Some aspects of agent quality can only be evaluated by humans:

  • Was the response helpful, or just correct?
  • Did the agent ask the right clarifying questions?
  • Would the user trust this agent with the same task again?

Build human evaluation into your pipeline. Even a small sample (50-100 interactions per week) provides signal that automated metrics miss entirely.

Subscribe to the newsletter

Get notified when I publish new articles about agent systems and AI engineering. No spam, unsubscribe anytime.