Building Reliable Agent Loops: Retry, Backoff, and Circuit Breakers
Most agent frameworks focus on the happy path — prompt in, action out, repeat. But production agent systems spend most of their complexity budget on what happens when things go wrong.
The Problem with Naive Loops
A simple while True agent loop works fine in demos. In production, it falls apart the moment an LLM API returns a 429, a tool call times out, or the model hallucinates an action that doesn’t exist.
# This will hurt you in production
while not done:
response = llm.call(prompt)
result = execute_action(response.action)
prompt = update_context(result)
The failure modes are predictable:
- Rate limits burn through retries instantly
- Timeouts cascade into the next iteration
- Invalid actions create infinite correction loops
Retry with Exponential Backoff
The first layer of defense is structured retry logic. Don’t just retry — retry intelligently:
async def resilient_call(fn, max_retries=3, base_delay=1.0):
for attempt in range(max_retries):
try:
return await fn()
except RateLimitError:
delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
await asyncio.sleep(delay)
except TimeoutError:
if attempt == max_retries - 1:
raise
continue
raise MaxRetriesExceeded()
The jitter (random.uniform) prevents thundering herd problems when multiple agents hit the same API simultaneously.
Circuit Breakers for Tool Calls
When a tool consistently fails, continuing to call it wastes tokens and time. A circuit breaker pattern short-circuits after a threshold:
class ToolCircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=60):
self.failures = 0
self.threshold = failure_threshold
self.reset_timeout = reset_timeout
self.last_failure = None
self.state = "closed" # closed = normal, open = blocking
async def call(self, tool_fn, *args):
if self.state == "open":
if time.time() - self.last_failure > self.reset_timeout:
self.state = "half-open"
else:
raise CircuitOpenError(f"Tool circuit open, retry after {self.reset_timeout}s")
try:
result = await tool_fn(*args)
self.failures = 0
self.state = "closed"
return result
except Exception as e:
self.failures += 1
self.last_failure = time.time()
if self.failures >= self.threshold:
self.state = "open"
raise
When the circuit opens, the agent can fall back to alternative tools or ask the user for help — instead of burning through rate limits on a broken dependency.
Graceful Degradation
The final piece is teaching the agent what to do when capabilities are unavailable. This means maintaining a fallback chain:
- Primary tool → direct execution
- Alternative tool → different implementation, same capability
- Human escalation → ask the user
- Graceful skip → acknowledge limitation, continue with reduced capability
The key insight: reliability in agent systems isn’t about preventing failures. It’s about making failures boring — predictable, contained, and recoverable.
Takeaways
- Always add jitter to retry delays
- Use circuit breakers for external tool calls
- Design fallback chains before you need them
- Log everything — debugging agent loops without traces is painful
- Set hard limits on total iterations to prevent runaway costs
Subscribe to the newsletter
Get notified when I publish new articles about agent systems and AI engineering. No spam, unsubscribe anytime.