Building Reliable Agent Loops: Retry, Backoff, and Circuit Breakers

Most agent frameworks focus on the happy path — prompt in, action out, repeat. But production agent systems spend most of their complexity budget on what happens when things go wrong.

The Problem with Naive Loops

A simple while True agent loop works fine in demos. In production, it falls apart the moment an LLM API returns a 429, a tool call times out, or the model hallucinates an action that doesn’t exist.

# This will hurt you in production
while not done:
    response = llm.call(prompt)
    result = execute_action(response.action)
    prompt = update_context(result)

The failure modes are predictable:

Rate limits burn through retries instantly
Timeouts cascade into the next iteration
Invalid actions create infinite correction loops

Retry with Exponential Backoff

The first layer of defense is structured retry logic. Don’t just retry — retry intelligently:

async def resilient_call(fn, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return await fn()
        except RateLimitError:
            delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
            await asyncio.sleep(delay)
        except TimeoutError:
            if attempt == max_retries - 1:
                raise
            continue
    raise MaxRetriesExceeded()

The jitter (random.uniform) prevents thundering herd problems when multiple agents hit the same API simultaneously.

Circuit Breakers for Tool Calls

When a tool consistently fails, continuing to call it wastes tokens and time. A circuit breaker pattern short-circuits after a threshold:

class ToolCircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=60):
        self.failures = 0
        self.threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.last_failure = None
        self.state = "closed"  # closed = normal, open = blocking

    async def call(self, tool_fn, *args):
        if self.state == "open":
            if time.time() - self.last_failure > self.reset_timeout:
                self.state = "half-open"
            else:
                raise CircuitOpenError(f"Tool circuit open, retry after {self.reset_timeout}s")

        try:
            result = await tool_fn(*args)
            self.failures = 0
            self.state = "closed"
            return result
        except Exception as e:
            self.failures += 1
            self.last_failure = time.time()
            if self.failures >= self.threshold:
                self.state = "open"
            raise

When the circuit opens, the agent can fall back to alternative tools or ask the user for help — instead of burning through rate limits on a broken dependency.

Graceful Degradation

The final piece is teaching the agent what to do when capabilities are unavailable. This means maintaining a fallback chain:

Primary tool → direct execution
Alternative tool → different implementation, same capability
Human escalation → ask the user
Graceful skip → acknowledge limitation, continue with reduced capability

The key insight: reliability in agent systems isn’t about preventing failures. It’s about making failures boring — predictable, contained, and recoverable.

Takeaways

Always add jitter to retry delays
Use circuit breakers for external tool calls
Design fallback chains before you need them
Log everything — debugging agent loops without traces is painful
Set hard limits on total iterations to prevent runaway costs