βΆWhat are the three pillars of observability?
Metrics (quantitative data: counters, gauges, histograms for throughput/latency/errors), Logs (events with context: structured JSON for debugging), and Traces (request flow across services: distributed tracing shows latency per span). Together they answer: Is the system healthy? Why is it slow? What failed? Metrics detect anomalies, logs provide context, traces show causality. Focusing on only one pillar (e.g., just metrics) leaves blind spots.
βΆTraces vs logs β what's the difference and when do I use which?
Logs are point-in-time events (single service, single timestamp). Traces follow a request across multiple services, correlating events via trace IDs. Logs answer 'what happened on this server?' Traces answer 'why did this user's request take 10 seconds?' Use logs for: application errors, security events, state changes. Use traces for: performance investigation, request flow visualization, latency attribution. In practice: logs feed into trace context (add trace ID to every log), and traces contain log references.
βΆHow do I define SLIs (Service Level Indicators) and SLOs (Objectives)?
SLI = measurable metric of reliability (e.g., '% of requests returning 200 in <500ms'). SLO = target for the SLI (e.g., '99.9% of requests'). Start by asking: What does the user care about? (availability, latency, error rate, freshness?). Define SLI first (measure it), then set realistic SLO (usually 1-2% below current performance). Example: SLI = (successful_requests / total_requests), SLO = 99.5%. Use SLOs to drive alerting: if you're tracking 10 SLOs, alert only when you're burning the error budget, not on every anomaly.
βΆHow do I instrument a service with OpenTelemetry?
Install the SDK for your language, add auto-instrumentation libraries (they hook common frameworks automatically), create spans for business logic (start_span('payment_processing')), and set span attributes (user_id, amount, status). Use trace propagators (W3C Trace Context) to pass trace IDs through headers. Ship metrics via SDK; they're collected by agents (otel-collector) and sent to backends (Datadog, Grafana, etc.). Tricky part: instrumentation is free (auto) but custom spans require code. Start with auto, then add custom spans for critical business operations.
βΆHow do I manage observability costs?
Observability data (especially traces) is expensive at scale. Strategies: (1) Sampling β capture 1% of traces at high traffic, 100% at low traffic. (2) Head-based sampling (sample at entry) vs tail-based (sample on exit, catching errors). (3) Cardinality controls β limit distinct values per dimension (don't add user ID to every metric). (4) Retention policies β keep raw traces 3 days, aggregates 1 year. (5) Only instrument revenue-critical paths for detailed traces. Cost example: 1M requests/day unsampled = 100GB traces/month; 1% sampling = 1GB/month. Budget: start with 1-2% of infrastructure cost, optimize from there.
βΆWhat's the difference between monitoring and observability?
Monitoring = checking known failure modes (alert if CPU >80%, disk >90%, response time >1s). Observability = investigating unknown-unknowns (why is latency spiking? why are timeouts increasing?). Monitoring is reactive; observability is exploratory. You need both: monitoring keeps the lights on (alerts), observability helps you debug when they flicker. Observability enables 'ask any question' β metrics, logs, traces are queryable without predefined dashboards.
βΆHow do I propagate trace context across services?
Use W3C Trace Context standard headers (traceparent + tracestate). When Service A calls Service B via HTTP, inject traceparent header with trace ID and parent span ID. Service B extracts it and creates child spans under the same trace. Frameworks (Spring, Django, Node) do this automatically with auto-instrumentation. For async/message queues: encode trace ID in message headers (Kafka, RabbitMQ headers). Cost: ~50 bytes per message, negligible. Without propagation: traces break at service boundaries, defeating the entire purpose.