STAR: Handling a Production Incident
Show structured thinking under pressure: calm triage, clear comms, blameless retro. Senior signal is ownership without heroics.
Deep dive
Structure
Situation — Context: scope, severity, who was impacted, blast radius. Task — Your specific role and what success looked like. Action — Decisions, comms, mitigation, rollback. Highlight tradeoffs you weighed. Result — Time to mitigate, customer impact, what changed permanently.
What interviewers listen for
- Did you stabilize before investigating?
- Were comms proportional (status page, exec brief)?
- Did the retro produce a systemic fix, not just "we'll be more careful"?
- Could you articulate what you'd do differently?
Real-world example
From production"During a deploy at 14:00, our checkout service started returning 503 to ~40% of traffic. I declared a Sev-2, rolled back the deploy within 3 minutes — that was the call: stabilize first, root-cause later. I assigned an incident commander role to a peer so I could focus on the technical mitigation. We posted a status update at 14:08. Postmortem found the new build had a misconfigured connection pool. We added: pre-deploy contract tests for pool sizing, automated canary analysis on 503 rate, and a runbook for connection-pool incidents. Three months later, an almost identical pattern was caught automatically in canary, never reached users."
Interview questions
2 senior-levelQ1Tell me about a production incident you led.▾
Use STAR. Lead with what you did, not what 'the team did'. Be specific about timing, decisions, and tradeoffs. End with the systemic improvement that came out of the retro — that's the senior signal.
Q2How do you write a postmortem?▾
Blameless framing. Sections: timeline (UTC), impact (users/duration/$), root cause, contributing factors, what went well, what didn't, action items with owners and dates. Publish broadly; track action items to completion.
Common mistakes
Heroic 'I worked 36 hours straight' — that's a process failure, not a brag.
Vague about what you specifically did.
Action items that are 'be more careful' instead of systemic.
Trade-offs
Fast rollback vs. preserving state for diagnosis — usually rollback first, snapshot logs and metrics.