Question 1

What are the three pillars of observability?

Accepted Answer

Metrics (quantitative data: counters, gauges, histograms for throughput/latency/errors), Logs (events with context: structured JSON for debugging), and Traces (request flow across services: distributed tracing shows latency per span). Together they answer: Is the system healthy? Why is it slow? What failed? Metrics detect anomalies, logs provide context, traces show causality. Focusing on only one pillar (e.g., just metrics) leaves blind spots.

Question 2

Traces vs logs — what's the difference and when do I use which?

Accepted Answer

Logs are point-in-time events (single service, single timestamp). Traces follow a request across multiple services, correlating events via trace IDs. Logs answer 'what happened on this server?' Traces answer 'why did this user's request take 10 seconds?' Use logs for: application errors, security events, state changes. Use traces for: performance investigation, request flow visualization, latency attribution. In practice: logs feed into trace context (add trace ID to every log), and traces contain log references.

Question 3

How do I define SLIs (Service Level Indicators) and SLOs (Objectives)?

Accepted Answer

SLI = measurable metric of reliability (e.g., '% of requests returning 200 in <500ms'). SLO = target for the SLI (e.g., '99.9% of requests'). Start by asking: What does the user care about? (availability, latency, error rate, freshness?). Define SLI first (measure it), then set realistic SLO (usually 1-2% below current performance). Example: SLI = (successful_requests / total_requests), SLO = 99.5%. Use SLOs to drive alerting: if you're tracking 10 SLOs, alert only when you're burning the error budget, not on every anomaly.

Question 4

How do I instrument a service with OpenTelemetry?

Accepted Answer

Install the SDK for your language, add auto-instrumentation libraries (they hook common frameworks automatically), create spans for business logic (start_span('payment_processing')), and set span attributes (user_id, amount, status). Use trace propagators (W3C Trace Context) to pass trace IDs through headers. Ship metrics via SDK; they're collected by agents (otel-collector) and sent to backends (Datadog, Grafana, etc.). Tricky part: instrumentation is free (auto) but custom spans require code. Start with auto, then add custom spans for critical business operations.

Question 5

How do I manage observability costs?

Accepted Answer

Observability data (especially traces) is expensive at scale. Strategies: (1) Sampling — capture 1% of traces at high traffic, 100% at low traffic. (2) Head-based sampling (sample at entry) vs tail-based (sample on exit, catching errors). (3) Cardinality controls — limit distinct values per dimension (don't add user ID to every metric). (4) Retention policies — keep raw traces 3 days, aggregates 1 year. (5) Only instrument revenue-critical paths for detailed traces. Cost example: 1M requests/day unsampled = 100GB traces/month; 1% sampling = 1GB/month. Budget: start with 1-2% of infrastructure cost, optimize from there.

Question 6

What's the difference between monitoring and observability?

Accepted Answer

Monitoring = checking known failure modes (alert if CPU >80%, disk >90%, response time >1s). Observability = investigating unknown-unknowns (why is latency spiking? why are timeouts increasing?). Monitoring is reactive; observability is exploratory. You need both: monitoring keeps the lights on (alerts), observability helps you debug when they flicker. Observability enables 'ask any question' — metrics, logs, traces are queryable without predefined dashboards.

Question 7

How do I propagate trace context across services?

Accepted Answer

Use W3C Trace Context standard headers (traceparent + tracestate). When Service A calls Service B via HTTP, inject traceparent header with trace ID and parent span ID. Service B extracts it and creates child spans under the same trace. Frameworks (Spring, Django, Node) do this automatically with auto-instrumentation. For async/message queues: encode trace ID in message headers (Kafka, RabbitMQ headers). Cost: ~50 bytes per message, negligible. Without propagation: traces break at service boundaries, defeating the entire purpose.

Region	Junior	Mid	Senior
USA	$110k	$150k	$215k
UK	£68k	£95k	£140k
EU	€72k	€100k	€150k
CANADA	C$115k	C$155k	C$225k

Observability

What is Observability

📋 Before you start

💰 Salary by region

🎓 Certifications

🎯 Careers using Observability

⚖ Compare with

❓ FAQ

Not sure this skill is for you?

Find your ideal career path