Behavioral / STAR/senior/freq 5/5

STAR: Handling a Production Incident

Show structured thinking under pressure: calm triage, clear comms, blameless retro. Senior signal is ownership without heroics.

starincidentleadership

Deep dive

Structure

Situation — Context: scope, severity, who was impacted, blast radius. Task — Your specific role and what success looked like. Action — Decisions, comms, mitigation, rollback. Highlight tradeoffs you weighed. Result — Time to mitigate, customer impact, what changed permanently.

What interviewers listen for

Did you stabilize before investigating?
Were comms proportional (status page, exec brief)?
Did the retro produce a systemic fix, not just "we'll be more careful"?
Could you articulate what you'd do differently?

Real-world example

From production

"During a deploy at 14:00, our checkout service started returning 503 to ~40% of traffic. I declared a Sev-2, rolled back the deploy within 3 minutes — that was the call: stabilize first, root-cause later. I assigned an incident commander role to a peer so I could focus on the technical mitigation. We posted a status update at 14:08. Postmortem found the new build had a misconfigured connection pool. We added: pre-deploy contract tests for pool sizing, automated canary analysis on 503 rate, and a runbook for connection-pool incidents. Three months later, an almost identical pattern was caught automatically in canary, never reached users."

Interview questions

2 senior-level

Q1Tell me about a production incident you led.▾

Use STAR. Lead with what you did, not what 'the team did'. Be specific about timing, decisions, and tradeoffs. End with the systemic improvement that came out of the retro — that's the senior signal.

Q2How do you write a postmortem?▾

Blameless framing. Sections: timeline (UTC), impact (users/duration/$), root cause, contributing factors, what went well, what didn't, action items with owners and dates. Publish broadly; track action items to completion.

Common mistakes

Heroic 'I worked 36 hours straight' — that's a process failure, not a brag.
Vague about what you specifically did.
Action items that are 'be more careful' instead of systemic.

Trade-offs

Fast rollback vs. preserving state for diagnosis — usually rollback first, snapshot logs and metrics.

STAR: Modernizing a Legacy System

Behavioral / STAR

Observability: Logs, Metrics, Traces

Kubernetes & DevOps