Observability: Logs, Metrics, Traces
Metrics for what, logs for why, traces for where. Cardinality kills your bill — design label schemas like you design DB schemas.
Deep dive
The three pillars + one
- Metrics (Prometheus): cheap aggregates, ideal for SLOs/alerts. Beware high-cardinality labels (
user_id,request_id) — they explode storage. - Logs (Loki, ELK): structured JSON, sampled at high volume. Logs are for human investigation; don't query them in alerts.
- Traces (OTel + Tempo/Jaeger): request-scoped causality. Sample smartly — head sampling drops most, tail sampling keeps errors.
- + Profiles (Pyroscope, Parca): always-on CPU/heap profiles let you ask "why is this slow?" weeks after the fact.
SLOs over alerts
Define SLOs (e.g., 99.9% of requests < 300 ms). Alert on error budget burn rate, not raw thresholds. A spike that doesn't burn budget shouldn't page anyone.
Real-world example
From productionAn alerting rule fired on "any 5xx in 1 minute". On-call burned out from 30 pages/week, mostly noise. Replaced with a multi-window burn-rate alert against a 99.5% / 30-day SLO. Pages dropped to 2/week and were all real incidents.
Interview questions
2 senior-levelQ1When would you use a trace vs a log?▾
Traces show the path of one request across services with timing — best for 'where is the latency?'. Logs explain what one service did — best for 'why did this fail?'. Correlate them via trace ID in every log line.
Q2What's wrong with alerting on raw error rate?▾
It pages on noise that doesn't impact users. Alert on SLO burn rate instead — that ties pages directly to user-visible impact and gives you an error budget to spend on deploys/experiments.
Common mistakes
Putting
request_idas a Prometheus label.Logging at INFO in hot loops — log volume bill, signal lost in noise.
Alerts without runbooks.
Trade-offs
Full tracing is expensive; head + tail sampling is the pragmatic middle.
OpenTelemetry standardizes instrumentation at the cost of an extra collector hop.