System Design
System Design/senior/freq 5/5

Resilience: Circuit Breakers, Retries & Timeouts

Timeouts always. Retries with jitter and a budget. Circuit breakers to fail fast when a dependency is down. Skip any of them and you've built a cascade-failure machine.

resiliencecircuit-breakerretry

Deep dive

Timeouts

Set on every network call, including DB. Default Java HTTP client timeout is infinite — that has caused real outages. Tune to be just longer than the dependency's p99 latency.

Retries

Only on idempotent operations. Exponential backoff with full jitter. Budget: cap total retry time to avoid amplifying a failing dependency.

Circuit breaker

After N consecutive failures, open the circuit — fail fast for a cooldown period, then half-open to probe. Resilience4j is the standard JVM choice.

Bulkhead

Isolate resource pools per dependency so a slow dependency can't exhaust the shared pool.

Real-world example

From production

Payment service had no circuit breaker for the fraud-check vendor. Vendor had a 30-second slowness; every thread in our pool blocked on it; we returned 503 for unrelated traffic for 45 minutes. Added Resilience4j with 5s timeout, breaker after 10 failures, fallback to "manual review" queue. Next vendor incident: <30s blip with degraded but functional service.

Interview questions

2 senior-level
Q1Should you always retry failed network calls?

No. Retry only idempotent operations. Add jitter to avoid retry storms. Have a global budget so retries don't amplify a failing downstream. For non-idempotent operations, retry only after applying an idempotency key.

Q2How does a circuit breaker work?

Tracks failure rate in a sliding window. Above threshold it opens, failing fast and protecting the downstream. After a cooldown it half-opens, allowing trial requests. On success it closes; on failure it re-opens. Pair with timeouts and fallback behavior.

Common mistakes

  • Default infinite HTTP timeouts.

  • Retrying on 4xx (the request was wrong; retrying won't help).

  • Circuit breaker per-method with no aggregate view in observability.

Trade-offs

  • Aggressive breakers cause false trips during transient blips.

  • Bulkheading wastes some resources but prevents pool exhaustion.

Related