Observability for AI Agent Systems
You can’t debug what you can’t see. Agent systems are notoriously difficult to observe because the interesting behavior happens inside the model — and you only see inputs and outputs.
Why Traditional APM Fails
Standard application monitoring tracks request latency, error rates, and throughput. For agent systems, these metrics miss the point entirely:
- A “successful” request might have taken 15 LLM calls internally
- Latency varies by 10x depending on the model’s reasoning path
- Error rates don’t capture “the agent did the wrong thing correctly”
You need metrics that understand agent-specific concerns: token spend per task, tool call success rates, reasoning loop depth, and goal completion rates.
The Trace Model
Every agent invocation should produce a structured trace:
@dataclass
class AgentTrace:
trace_id: str
task: str
started_at: datetime
completed_at: datetime
outcome: Literal["success", "failure", "timeout", "escalated"]
# Agent-specific metrics
total_tokens: int
total_cost: float
llm_calls: int
tool_calls: list[ToolCallTrace]
reasoning_steps: int
# Decision audit trail
decisions: list[Decision]
context_at_each_step: list[str]
The decisions field is crucial. It captures why the agent did what it did — not just what it did. This turns debugging from guesswork into analysis.
Token Economics Dashboard
Token usage is the cloud bill equivalent for agent systems. Track it ruthlessly:
class TokenTracker:
def __init__(self):
self.usage = defaultdict(lambda: {"input": 0, "output": 0, "cost": 0.0})
def record(self, model: str, input_tokens: int, output_tokens: int):
pricing = MODEL_PRICING[model]
cost = (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1000
self.usage[model]["input"] += input_tokens
self.usage[model]["output"] += output_tokens
self.usage[model]["cost"] += cost
def report(self) -> dict:
return {
"total_cost": sum(m["cost"] for m in self.usage.values()),
"by_model": dict(self.usage),
"input_output_ratio": self._io_ratio(),
}
A high input-to-output ratio often signals context bloat — the agent is sending too much history with each request.
Tool Call Observability
Every tool call should be instrumented with:
- Latency — how long did the tool take?
- Success/failure — did it return a usable result?
- Input sanitization — what did the agent actually pass to the tool?
- Output size — how much data came back (and how much got truncated)?
async def instrumented_tool_call(tool, args, tracer):
span = tracer.start_span(f"tool.{tool.name}")
span.set_attribute("tool.args", sanitize(args))
try:
result = await asyncio.wait_for(tool.execute(args), timeout=30)
span.set_attribute("tool.status", "success")
span.set_attribute("tool.result_size", len(str(result)))
return result
except TimeoutError:
span.set_attribute("tool.status", "timeout")
raise
except Exception as e:
span.set_attribute("tool.status", "error")
span.set_attribute("tool.error", str(e))
raise
finally:
span.end()
Alerting That Makes Sense
Traditional alerts (error rate > 5%) don’t work for agents. Instead, alert on:
- Cost per task exceeding baseline by 3x
- Reasoning loop depth exceeding maximum (agent is stuck)
- Tool failure rate for a specific tool crossing threshold
- Goal completion rate dropping below historical average
- Token usage approaching budget limits
These alerts catch the problems that matter: runaway costs, stuck agents, and degrading capabilities.
Takeaways
- Trace every LLM call, tool call, and decision point
- Track token economics as a first-class metric
- Build dashboards around agent-specific concerns, not generic APM
- Alert on cost anomalies and stuck loops, not just errors
- Keep decision audit trails — you’ll need them when the agent does something unexpected
Subscribe to the newsletter
Get notified when I publish new articles about agent systems and AI engineering. No spam, unsubscribe anytime.