Behavioral / STAR
Behavioral / STAR/senior/freq 5/5

STAR: Handling a Production Incident

Show structured thinking under pressure: calm triage, clear comms, blameless retro. Senior signal is ownership without heroics.

starincidentleadership

Deep dive

Structure

Situation — Context: scope, severity, who was impacted, blast radius. Task — Your specific role and what success looked like. Action — Decisions, comms, mitigation, rollback. Highlight tradeoffs you weighed. Result — Time to mitigate, customer impact, what changed permanently.

What interviewers listen for

  • Did you stabilize before investigating?
  • Were comms proportional (status page, exec brief)?
  • Did the retro produce a systemic fix, not just "we'll be more careful"?
  • Could you articulate what you'd do differently?

Real-world example

From production

"During a deploy at 14:00, our checkout service started returning 503 to ~40% of traffic. I declared a Sev-2, rolled back the deploy within 3 minutes — that was the call: stabilize first, root-cause later. I assigned an incident commander role to a peer so I could focus on the technical mitigation. We posted a status update at 14:08. Postmortem found the new build had a misconfigured connection pool. We added: pre-deploy contract tests for pool sizing, automated canary analysis on 503 rate, and a runbook for connection-pool incidents. Three months later, an almost identical pattern was caught automatically in canary, never reached users."

Interview questions

2 senior-level
Q1Tell me about a production incident you led.

Use STAR. Lead with what you did, not what 'the team did'. Be specific about timing, decisions, and tradeoffs. End with the systemic improvement that came out of the retro — that's the senior signal.

Q2How do you write a postmortem?

Blameless framing. Sections: timeline (UTC), impact (users/duration/$), root cause, contributing factors, what went well, what didn't, action items with owners and dates. Publish broadly; track action items to completion.

Common mistakes

  • Heroic 'I worked 36 hours straight' — that's a process failure, not a brag.

  • Vague about what you specifically did.

  • Action items that are 'be more careful' instead of systemic.

Trade-offs

  • Fast rollback vs. preserving state for diagnosis — usually rollback first, snapshot logs and metrics.

Related