Kubernetes & DevOps/senior/freq 5/5

Observability: Logs, Metrics, Traces

Metrics for what, logs for why, traces for where. Cardinality kills your bill — design label schemas like you design DB schemas.

observabilitymonitoringsre

Deep dive

The three pillars + one

Metrics (Prometheus): cheap aggregates, ideal for SLOs/alerts. Beware high-cardinality labels (user_id, request_id) — they explode storage.
Logs (Loki, ELK): structured JSON, sampled at high volume. Logs are for human investigation; don't query them in alerts.
Traces (OTel + Tempo/Jaeger): request-scoped causality. Sample smartly — head sampling drops most, tail sampling keeps errors.
+ Profiles (Pyroscope, Parca): always-on CPU/heap profiles let you ask "why is this slow?" weeks after the fact.

SLOs over alerts

Define SLOs (e.g., 99.9% of requests < 300 ms). Alert on error budget burn rate, not raw thresholds. A spike that doesn't burn budget shouldn't page anyone.

Real-world example

From production

An alerting rule fired on "any 5xx in 1 minute". On-call burned out from 30 pages/week, mostly noise. Replaced with a multi-window burn-rate alert against a 99.5% / 30-day SLO. Pages dropped to 2/week and were all real incidents.

Interview questions

2 senior-level

Q1When would you use a trace vs a log?▾

Traces show the path of one request across services with timing — best for 'where is the latency?'. Logs explain what one service did — best for 'why did this fail?'. Correlate them via trace ID in every log line.

Q2What's wrong with alerting on raw error rate?▾

It pages on noise that doesn't impact users. Alert on SLO burn rate instead — that ties pages directly to user-visible impact and gives you an error budget to spend on deploys/experiments.

Common mistakes

Putting request_id as a Prometheus label.
Logging at INFO in hot loops — log volume bill, signal lost in noise.
Alerts without runbooks.

Trade-offs

Full tracing is expensive; head + tail sampling is the pragmatic middle.
OpenTelemetry standardizes instrumentation at the cost of an extra collector hop.

Resilience: Circuit Breakers, Retries & Timeouts

System Design

Designing a CI/CD Pipeline

Kubernetes & DevOps