Skip to main content
JobCannon
All skills

Observability

Understanding system behavior through logs, metrics, and traces

β¬’ TIER 2Tech
+$20k-
Salary impact
6 months
Time to learn
Medium
Difficulty
7
Careers
AT A GLANCE

Observability is the discipline of understanding system behavior through metrics, logs, and distributed traces (three pillars). Career path: Practitioner (structured logging, basic metrics, Prometheus, $110-140k) β†’ Specialist (OpenTelemetry instrumentation, SLI/SLO design, trace analysis, $140-180k) β†’ Architect (observability platform design, cardinality management, cost optimization, $180-240k+) over 4-6 months. Used in Site Reliability Engineering (L1+, $120k-195k), DevOps (monitoring infrastructure, $115k-180k), Backend/Platform Engineering (L2+, $115k-195k). Salary premium: $20k-35k above base roles.

What is Observability

Observability is the ability to understand what's happening inside a system by examining its outputs: logs, metrics, and traces (the three pillars). Unlike monitoring (which checks known failure modes), observability enables investigating unknown-unknowns and debugging novel problems in production. Modern distributed systems require observability to understand request flows across services, identify performance bottlenecks, and detect anomalies before users are impacted.

πŸ”§ TOOLS & ECOSYSTEM
OpenTelemetryHoneycombLightstepGrafana TempoPrometheusGrafanaLokiDatadog ObservabilitySplunk Observability CloudDynatraceNew RelicTracing libraries (Jaeger, Zipkin)

πŸ’° Salary by region

RegionJuniorMidSenior
USA$110k$150k$215k
UKΒ£68kΒ£95kΒ£140k
EU€72k€100k€150k
CANADAC$115kC$155kC$225k

❓ FAQ

What are the three pillars of observability?
Metrics (quantitative data: counters, gauges, histograms for throughput/latency/errors), Logs (events with context: structured JSON for debugging), and Traces (request flow across services: distributed tracing shows latency per span). Together they answer: Is the system healthy? Why is it slow? What failed? Metrics detect anomalies, logs provide context, traces show causality. Focusing on only one pillar (e.g., just metrics) leaves blind spots.
Traces vs logs β€” what's the difference and when do I use which?
Logs are point-in-time events (single service, single timestamp). Traces follow a request across multiple services, correlating events via trace IDs. Logs answer 'what happened on this server?' Traces answer 'why did this user's request take 10 seconds?' Use logs for: application errors, security events, state changes. Use traces for: performance investigation, request flow visualization, latency attribution. In practice: logs feed into trace context (add trace ID to every log), and traces contain log references.
How do I define SLIs (Service Level Indicators) and SLOs (Objectives)?
SLI = measurable metric of reliability (e.g., '% of requests returning 200 in <500ms'). SLO = target for the SLI (e.g., '99.9% of requests'). Start by asking: What does the user care about? (availability, latency, error rate, freshness?). Define SLI first (measure it), then set realistic SLO (usually 1-2% below current performance). Example: SLI = (successful_requests / total_requests), SLO = 99.5%. Use SLOs to drive alerting: if you're tracking 10 SLOs, alert only when you're burning the error budget, not on every anomaly.
How do I instrument a service with OpenTelemetry?
Install the SDK for your language, add auto-instrumentation libraries (they hook common frameworks automatically), create spans for business logic (start_span('payment_processing')), and set span attributes (user_id, amount, status). Use trace propagators (W3C Trace Context) to pass trace IDs through headers. Ship metrics via SDK; they're collected by agents (otel-collector) and sent to backends (Datadog, Grafana, etc.). Tricky part: instrumentation is free (auto) but custom spans require code. Start with auto, then add custom spans for critical business operations.
How do I manage observability costs?
Observability data (especially traces) is expensive at scale. Strategies: (1) Sampling β€” capture 1% of traces at high traffic, 100% at low traffic. (2) Head-based sampling (sample at entry) vs tail-based (sample on exit, catching errors). (3) Cardinality controls β€” limit distinct values per dimension (don't add user ID to every metric). (4) Retention policies β€” keep raw traces 3 days, aggregates 1 year. (5) Only instrument revenue-critical paths for detailed traces. Cost example: 1M requests/day unsampled = 100GB traces/month; 1% sampling = 1GB/month. Budget: start with 1-2% of infrastructure cost, optimize from there.
What's the difference between monitoring and observability?
Monitoring = checking known failure modes (alert if CPU >80%, disk >90%, response time >1s). Observability = investigating unknown-unknowns (why is latency spiking? why are timeouts increasing?). Monitoring is reactive; observability is exploratory. You need both: monitoring keeps the lights on (alerts), observability helps you debug when they flicker. Observability enables 'ask any question' β€” metrics, logs, traces are queryable without predefined dashboards.
How do I propagate trace context across services?
Use W3C Trace Context standard headers (traceparent + tracestate). When Service A calls Service B via HTTP, inject traceparent header with trace ID and parent span ID. Service B extracts it and creates child spans under the same trace. Frameworks (Spring, Django, Node) do this automatically with auto-instrumentation. For async/message queues: encode trace ID in message headers (Kafka, RabbitMQ headers). Cost: ~50 bytes per message, negligible. Without propagation: traces break at service boundaries, defeating the entire purpose.

Not sure this skill is for you?

Take a 10-min Career Match β€” we'll suggest the right tracks.

Find my best-fit skills β†’

Find your ideal career path

Skill-based matching across 2,536 careers. Free, ~10 minutes.

Take Career Match β€” free β†’