· Implementation

Building Reliable Agent Loops: Retry, Backoff, and Circuit Breakers

agents reliability systems-design

Most agent frameworks focus on the happy path — prompt in, action out, repeat. But production agent systems spend most of their complexity budget on what happens when things go wrong.

The Problem with Naive Loops

A simple while True agent loop works fine in demos. In production, it falls apart the moment an LLM API returns a 429, a tool call times out, or the model hallucinates an action that doesn’t exist.

# This will hurt you in production
while not done:
    response = llm.call(prompt)
    result = execute_action(response.action)
    prompt = update_context(result)

The failure modes are predictable:

  • Rate limits burn through retries instantly
  • Timeouts cascade into the next iteration
  • Invalid actions create infinite correction loops

Retry with Exponential Backoff

The first layer of defense is structured retry logic. Don’t just retry — retry intelligently:

async def resilient_call(fn, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return await fn()
        except RateLimitError:
            delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
            await asyncio.sleep(delay)
        except TimeoutError:
            if attempt == max_retries - 1:
                raise
            continue
    raise MaxRetriesExceeded()

The jitter (random.uniform) prevents thundering herd problems when multiple agents hit the same API simultaneously.

Circuit Breakers for Tool Calls

When a tool consistently fails, continuing to call it wastes tokens and time. A circuit breaker pattern short-circuits after a threshold:

class ToolCircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=60):
        self.failures = 0
        self.threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.last_failure = None
        self.state = "closed"  # closed = normal, open = blocking

    async def call(self, tool_fn, *args):
        if self.state == "open":
            if time.time() - self.last_failure > self.reset_timeout:
                self.state = "half-open"
            else:
                raise CircuitOpenError(f"Tool circuit open, retry after {self.reset_timeout}s")

        try:
            result = await tool_fn(*args)
            self.failures = 0
            self.state = "closed"
            return result
        except Exception as e:
            self.failures += 1
            self.last_failure = time.time()
            if self.failures >= self.threshold:
                self.state = "open"
            raise

When the circuit opens, the agent can fall back to alternative tools or ask the user for help — instead of burning through rate limits on a broken dependency.

Graceful Degradation

The final piece is teaching the agent what to do when capabilities are unavailable. This means maintaining a fallback chain:

  1. Primary tool → direct execution
  2. Alternative tool → different implementation, same capability
  3. Human escalation → ask the user
  4. Graceful skip → acknowledge limitation, continue with reduced capability

The key insight: reliability in agent systems isn’t about preventing failures. It’s about making failures boring — predictable, contained, and recoverable.

Takeaways

  • Always add jitter to retry delays
  • Use circuit breakers for external tool calls
  • Design fallback chains before you need them
  • Log everything — debugging agent loops without traces is painful
  • Set hard limits on total iterations to prevent runaway costs

Subscribe to the newsletter

Get notified when I publish new articles about agent systems and AI engineering. No spam, unsubscribe anytime.