Observability for AI Agent Systems

You can’t debug what you can’t see. Agent systems are notoriously difficult to observe because the interesting behavior happens inside the model — and you only see inputs and outputs.

Why Traditional APM Fails

Standard application monitoring tracks request latency, error rates, and throughput. For agent systems, these metrics miss the point entirely:

A “successful” request might have taken 15 LLM calls internally
Latency varies by 10x depending on the model’s reasoning path
Error rates don’t capture “the agent did the wrong thing correctly”

You need metrics that understand agent-specific concerns: token spend per task, tool call success rates, reasoning loop depth, and goal completion rates.

The Trace Model

Every agent invocation should produce a structured trace:

@dataclass
class AgentTrace:
    trace_id: str
    task: str
    started_at: datetime
    completed_at: datetime
    outcome: Literal["success", "failure", "timeout", "escalated"]

    # Agent-specific metrics
    total_tokens: int
    total_cost: float
    llm_calls: int
    tool_calls: list[ToolCallTrace]
    reasoning_steps: int

    # Decision audit trail
    decisions: list[Decision]
    context_at_each_step: list[str]

The decisions field is crucial. It captures why the agent did what it did — not just what it did. This turns debugging from guesswork into analysis.

Token Economics Dashboard

Token usage is the cloud bill equivalent for agent systems. Track it ruthlessly:

class TokenTracker:
    def __init__(self):
        self.usage = defaultdict(lambda: {"input": 0, "output": 0, "cost": 0.0})

    def record(self, model: str, input_tokens: int, output_tokens: int):
        pricing = MODEL_PRICING[model]
        cost = (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1000
        self.usage[model]["input"] += input_tokens
        self.usage[model]["output"] += output_tokens
        self.usage[model]["cost"] += cost

    def report(self) -> dict:
        return {
            "total_cost": sum(m["cost"] for m in self.usage.values()),
            "by_model": dict(self.usage),
            "input_output_ratio": self._io_ratio(),
        }

A high input-to-output ratio often signals context bloat — the agent is sending too much history with each request.

Tool Call Observability

Every tool call should be instrumented with:

Latency — how long did the tool take?
Success/failure — did it return a usable result?
Input sanitization — what did the agent actually pass to the tool?
Output size — how much data came back (and how much got truncated)?

async def instrumented_tool_call(tool, args, tracer):
    span = tracer.start_span(f"tool.{tool.name}")
    span.set_attribute("tool.args", sanitize(args))

    try:
        result = await asyncio.wait_for(tool.execute(args), timeout=30)
        span.set_attribute("tool.status", "success")
        span.set_attribute("tool.result_size", len(str(result)))
        return result
    except TimeoutError:
        span.set_attribute("tool.status", "timeout")
        raise
    except Exception as e:
        span.set_attribute("tool.status", "error")
        span.set_attribute("tool.error", str(e))
        raise
    finally:
        span.end()

Alerting That Makes Sense

Traditional alerts (error rate > 5%) don’t work for agents. Instead, alert on:

Cost per task exceeding baseline by 3x
Reasoning loop depth exceeding maximum (agent is stuck)
Tool failure rate for a specific tool crossing threshold
Goal completion rate dropping below historical average
Token usage approaching budget limits

These alerts catch the problems that matter: runaway costs, stuck agents, and degrading capabilities.

Takeaways

Trace every LLM call, tool call, and decision point
Track token economics as a first-class metric
Build dashboards around agent-specific concerns, not generic APM
Alert on cost anomalies and stuck loops, not just errors
Keep decision audit trails — you’ll need them when the agent does something unexpected