system-design

Resilience Patterns: Retry, Circuit Breaker, and Timeouts

A few months ago I was reviewing an incident postmortem where a payment service took down the entire checkout flow for 40 minutes. The root cause wasn't a...

2 May 2026

A few months ago I was reviewing an incident postmortem where a payment service took down the entire checkout flow for 40 minutes. The root cause wasn't a bug. The downstream payment provider had a brief hiccup, maybe 15 seconds of high latency, and the checkout service kept retrying every request in a tight loop. Thread pool exhausted. Connections piled up. The payment provider recovered but the checkout service didn't, because it had already spent all its resources hammering a service that was already struggling.

The team had implemented retries. They hadn't implemented the rest.

Resilience patterns are not a menu you can pick from. They compose. Retry without a circuit breaker is a liability during an outage. A circuit breaker without timeouts is a circuit breaker that never opens. Timeouts without retry just fail fast with no recovery. Used together, they give your system the ability to degrade gracefully instead of collapsing.

Start with timeouts

Every network call needs a timeout. This isn't optional and it isn't subtle. If you're making an HTTP request to a downstream service with no timeout configured, you're one slow endpoint away from blocking all your threads indefinitely.

What surprises engineers is how low timeouts should be. For a service that normally responds in 50ms, a 5-second timeout feels very conservative. It isn't. A 5-second timeout means every slow request ties up a thread for 5 seconds. At meaningful traffic, that's enough to exhaust your connection pool on a single slow dependency.

The right timeout is derived from your SLA, not from "what feels safe." If your endpoint needs to respond in 200ms, your downstream timeout budget might be 80ms. That leaves room for your own processing and for the occasional retry. Work backwards from the number you're accountable for.

There's also the question of which timeout to set. Connection timeout (how long to wait to establish the connection) and read timeout (how long to wait for data after connecting) are different. Both matter. An overloaded service may accept the TCP connection but stall before sending a byte. Without a read timeout, you'd never know.

Retry: the naive version and why it fails

The instinct after adding timeouts is to wrap failed calls in a retry loop. Three attempts with a fixed delay. This works fine on a quiet staging environment and falls apart in production during the exact moment you need it.

Fixed-delay retry under load creates a synchronised wave of requests. If 1,000 clients all fail at 10:00:00 and retry at 10:00:01, the downstream service gets a second spike exactly one second later. If that spike fails too, it retries at 10:00:02. You've turned a brief outage into a sustained thundering herd that prevents the downstream service from recovering.

Exponential backoff solves the wave problem. Instead of retrying after 1 second each time, you retry after 1 second, then 2 seconds, then 4 seconds. The gaps grow exponentially, which spreads the load over time and gives the failing service room to recover. Most HTTP client libraries have this built in.

Jitter solves the remaining problem. Even with exponential backoff, if 1,000 clients all start at exactly the same moment, they'll all back off to exactly the same times. Jitter adds a small random offset to each delay so the retries spread out across a window rather than landing as a simultaneous burst. AWS has a good write-up on this with the exact algorithm, but the simple version is: delay = random_between(0, base_delay * 2^attempt).

One more thing: not everything is worth retrying. HTTP 429 (rate limited) and 503 (service unavailable) are good candidates. HTTP 400 (bad request) is not: the request is malformed, and retrying it will produce the same 400 every time. Be explicit about which status codes or exception types trigger a retry.

Circuit breaker: the pattern that makes retries safe

The checkout incident I described happened because retries kept hitting a service that was already under stress. The correct behaviour is: after a certain number of failures, stop trying and return an error immediately. Give the downstream service time to recover. Then try again carefully. This is what a circuit breaker does.

The pattern is a state machine with three states. In the closed state, requests flow through normally. Failures are counted. When failures exceed a threshold within a window, the breaker opens. In the open state, requests are rejected immediately without hitting the downstream service. After a configured timeout, the breaker moves to half-open. In this state, a limited number of probe requests are allowed through. If they succeed, the breaker closes again. If they fail, it goes back to open.

The key parameter choices are: failure threshold (how many failures trigger opening), time window (over what period), open duration (how long before trying again), and half-open probe count (how many requests to test with before fully closing). These need to be tuned per dependency. A payment provider that costs real money to call has different tolerances than an internal recommendation service.

Libraries like Resilience4j (Java/Kotlin), Polly (.NET), and Hystrix (deprecated but still in production everywhere) implement this pattern. Most modern service meshes like Istio handle it at the network layer so your application code doesn't need to. But even if the mesh handles it, you need to know the semantics so you can configure it correctly.

Circuit breakers also give you a natural place to implement fallback behaviour. When the breaker is open, instead of propagating an error, you can return a cached result, a default value, or a degraded response. A product page that shows a "price unavailable" message is better than a 500 error. Whether a fallback makes sense depends on whether the missing data is critical to the user's task.

How these patterns compose

In a real service, they layer. A typical outbound call might look like this:

Text
circuit breaker (fail fast if open)
  → retry with exponential backoff + jitter
    → timeout per attempt
      → actual HTTP call

The circuit breaker wraps everything. If the breaker is open, you never even attempt the retry loop. If it's closed, each attempt has its own timeout. Failures from timeouts count toward the breaker's failure threshold. After enough failures, the breaker opens and future callers fail fast until the service recovers.

The relationship between retry count and circuit breaker threshold matters. If your breaker opens after 10 failures and your retry logic tries 5 times per request, two concurrent requests that both fail all their retries can trip the breaker. That might be fine, or it might be more aggressive than you intended. Think it through explicitly.

What these patterns don't solve

These patterns are not a substitute for fixing the actual problem. If a dependency is slow because your query is unindexed, adding a circuit breaker hides the degradation. If a service is failing because it's under-provisioned, retry with jitter distributes the load but doesn't create more capacity. Resilience patterns buy you time and prevent cascade failures. They do not fix the root cause.

They also don't help with correctness. If a request is non-idempotent (it writes data, sends an email, charges a card), retrying on failure can produce duplicate effects. Before adding retry logic to a write path, verify the endpoint is idempotent or use an idempotency key. A retry that causes a user to be charged twice is worse than no retry at all.

I've seen teams add all three patterns, check them off a list, and declare the system resilient. What they've built is a system that fails more slowly and more gracefully, which is valuable. But resilience is really about whether you can detect problems quickly, understand what's happening, and recover without human intervention. The patterns are tools. The discipline of instrumenting them, alerting on circuit breaker state changes, and reviewing timeout thresholds regularly is what makes them effective.

Most services that have been running for a few years have at least one timeout set to "whatever the default was" and a retry that nobody has thought about since it was first written. That's usually fine, until it isn't.


If you're working through system design for a senior or staff role, I go deep on availability and failure modes in my System Design course. Or if you'd prefer to talk through a specific design, book a call and we can work through it together.

Keep reading