Kubernetes & DevOps
Kubernetes & DevOps/senior/freq 5/5

Liveness vs Readiness Probes

Liveness restarts a broken pod; readiness removes it from the Service. Confusing the two is the #1 cause of self-inflicted outages during deploys.

kubernetesprobesreliability

Deep dive

The rules

  • Liveness fails → kubelet kills and restarts the container. Use for true unrecoverable state (deadlock, exhausted thread pool).
  • Readiness fails → endpoint controller removes the pod from Services. Use for "I'm temporarily not ready" (DB reconnect, warmup, draining).
  • Startup probe → disables liveness/readiness until the app is up. Use for slow starters; protects against premature liveness kills.

Common anti-pattern

Pointing liveness at /health which checks DB connectivity. When the DB hiccups, every pod restarts simultaneously, stampedes the DB on reconnect, and turns a 10-second blip into a 10-minute outage. Liveness should check process health, not dependency health.

Real-world example

From production

A retail platform suffered a 40-minute outage during Black Friday. The Postgres primary failed over (15 s blip). Every app pod's liveness probe checked the DB, all failed, kubelet restarted them in lock-step. On restart they hammered the still-recovering replica. Fix: liveness now only checks "JVM is responsive"; readiness checks dependencies. Next failover: 30 s of degraded reads, zero restarts.

Interview questions

2 senior-level
Q1What's the difference between liveness and readiness?

Liveness answers 'is this process alive?' — failure means restart. Readiness answers 'should I get traffic right now?' — failure means temporary removal from the Service. Liveness should be a cheap process check; readiness can check dependencies.

Q2Why use a startup probe?

To prevent liveness/readiness from running until the app has finished initializing. Without it, a 60-second JVM startup would be killed by a liveness probe with a 30-second initialDelay. Startup probe disables the others until it succeeds.

Common mistakes

  • Liveness probe that checks downstream dependencies.

  • Same endpoint for liveness and readiness.

  • Aggressive failureThreshold causing flap during transient network blips.

Trade-offs

  • Permissive probes hide real failures; strict probes cause restart storms. Tune based on actual MTTR data.

Related