Kubernetes & DevOps
Kubernetes & DevOps/senior/freq 5/5

Observability: Logs, Metrics, Traces

Metrics for what, logs for why, traces for where. Cardinality kills your bill — design label schemas like you design DB schemas.

observabilitymonitoringsre

Deep dive

The three pillars + one

  • Metrics (Prometheus): cheap aggregates, ideal for SLOs/alerts. Beware high-cardinality labels (user_id, request_id) — they explode storage.
  • Logs (Loki, ELK): structured JSON, sampled at high volume. Logs are for human investigation; don't query them in alerts.
  • Traces (OTel + Tempo/Jaeger): request-scoped causality. Sample smartly — head sampling drops most, tail sampling keeps errors.
  • + Profiles (Pyroscope, Parca): always-on CPU/heap profiles let you ask "why is this slow?" weeks after the fact.

SLOs over alerts

Define SLOs (e.g., 99.9% of requests < 300 ms). Alert on error budget burn rate, not raw thresholds. A spike that doesn't burn budget shouldn't page anyone.

Real-world example

From production

An alerting rule fired on "any 5xx in 1 minute". On-call burned out from 30 pages/week, mostly noise. Replaced with a multi-window burn-rate alert against a 99.5% / 30-day SLO. Pages dropped to 2/week and were all real incidents.

Interview questions

2 senior-level
Q1When would you use a trace vs a log?

Traces show the path of one request across services with timing — best for 'where is the latency?'. Logs explain what one service did — best for 'why did this fail?'. Correlate them via trace ID in every log line.

Q2What's wrong with alerting on raw error rate?

It pages on noise that doesn't impact users. Alert on SLO burn rate instead — that ties pages directly to user-visible impact and gives you an error budget to spend on deploys/experiments.

Common mistakes

  • Putting request_id as a Prometheus label.

  • Logging at INFO in hot loops — log volume bill, signal lost in noise.

  • Alerts without runbooks.

Trade-offs

  • Full tracing is expensive; head + tail sampling is the pragmatic middle.

  • OpenTelemetry standardizes instrumentation at the cost of an extra collector hop.

Related