Resilience: Circuit Breakers, Retries & Timeouts
Timeouts always. Retries with jitter and a budget. Circuit breakers to fail fast when a dependency is down. Skip any of them and you've built a cascade-failure machine.
Deep dive
Timeouts
Set on every network call, including DB. Default Java HTTP client timeout is infinite — that has caused real outages. Tune to be just longer than the dependency's p99 latency.
Retries
Only on idempotent operations. Exponential backoff with full jitter. Budget: cap total retry time to avoid amplifying a failing dependency.
Circuit breaker
After N consecutive failures, open the circuit — fail fast for a cooldown period, then half-open to probe. Resilience4j is the standard JVM choice.
Bulkhead
Isolate resource pools per dependency so a slow dependency can't exhaust the shared pool.
Real-world example
From productionPayment service had no circuit breaker for the fraud-check vendor. Vendor had a 30-second slowness; every thread in our pool blocked on it; we returned 503 for unrelated traffic for 45 minutes. Added Resilience4j with 5s timeout, breaker after 10 failures, fallback to "manual review" queue. Next vendor incident: <30s blip with degraded but functional service.
Interview questions
2 senior-levelQ1Should you always retry failed network calls?▾
No. Retry only idempotent operations. Add jitter to avoid retry storms. Have a global budget so retries don't amplify a failing downstream. For non-idempotent operations, retry only after applying an idempotency key.
Q2How does a circuit breaker work?▾
Tracks failure rate in a sliding window. Above threshold it opens, failing fast and protecting the downstream. After a cooldown it half-opens, allowing trial requests. On success it closes; on failure it re-opens. Pair with timeouts and fallback behavior.
Common mistakes
Default infinite HTTP timeouts.
Retrying on 4xx (the request was wrong; retrying won't help).
Circuit breaker per-method with no aggregate view in observability.
Trade-offs
Aggressive breakers cause false trips during transient blips.
Bulkheading wastes some resources but prevents pool exhaustion.