Liveness vs Readiness Probes
Liveness restarts a broken pod; readiness removes it from the Service. Confusing the two is the #1 cause of self-inflicted outages during deploys.
Deep dive
The rules
- Liveness fails → kubelet kills and restarts the container. Use for true unrecoverable state (deadlock, exhausted thread pool).
- Readiness fails → endpoint controller removes the pod from Services. Use for "I'm temporarily not ready" (DB reconnect, warmup, draining).
- Startup probe → disables liveness/readiness until the app is up. Use for slow starters; protects against premature liveness kills.
Common anti-pattern
Pointing liveness at /health which checks DB connectivity. When the DB hiccups, every pod restarts simultaneously, stampedes the DB on reconnect, and turns a 10-second blip into a 10-minute outage. Liveness should check process health, not dependency health.
Real-world example
From productionA retail platform suffered a 40-minute outage during Black Friday. The Postgres primary failed over (15 s blip). Every app pod's liveness probe checked the DB, all failed, kubelet restarted them in lock-step. On restart they hammered the still-recovering replica. Fix: liveness now only checks "JVM is responsive"; readiness checks dependencies. Next failover: 30 s of degraded reads, zero restarts.
Interview questions
2 senior-levelQ1What's the difference between liveness and readiness?▾
Liveness answers 'is this process alive?' — failure means restart. Readiness answers 'should I get traffic right now?' — failure means temporary removal from the Service. Liveness should be a cheap process check; readiness can check dependencies.
Q2Why use a startup probe?▾
To prevent liveness/readiness from running until the app has finished initializing. Without it, a 60-second JVM startup would be killed by a liveness probe with a 30-second initialDelay. Startup probe disables the others until it succeeds.
Common mistakes
Liveness probe that checks downstream dependencies.
Same endpoint for liveness and readiness.
Aggressive failureThreshold causing flap during transient network blips.
Trade-offs
Permissive probes hide real failures; strict probes cause restart storms. Tune based on actual MTTR data.