Skip to main content
JobCannon
All skills

Monitoring & Observability

Know what's happening in production: logs, metrics, traces, alerts

β¬’ TIER 2Tech
+$30k-
Salary impact
6 months
Time to learn
Hard
Difficulty
4
Careers
TL;DR

Monitoring & Observability helps you understand what's happening in production systems through logs, metrics, and traces. Career path: Practitioner (basic logs/alerts, CloudWatch/Datadog, $120-150k) β†’ Specialist (distributed tracing, SLOs, incident response, $150-190k) β†’ Architect (observability platform design, OpenTelemetry, multi-service correlation, $190-250k+) over 6-8 months. Salary premium: $30k-$65k above base backend (DevOps/SRE tier). Essential for production reliability and incident response. Competes with application performance monitoring (APM) specialists but requires broader systems thinking.

What is Monitoring & Observability

Monitoring and observability are how you understand what's happening in production systems. Monitoring tells you when something is wrong (alerts). Observability tells you why it's wrong (debugging with logs, metrics, traces). Essential for DevOps, SRE, and backend roles. - Production reliability: Can't fix what you can't see

πŸ”§ TOOLS & ECOSYSTEM
DatadogPrometheusGrafanaNew RelicSplunkELK StackHoneycombOpenTelemetryPagerDutySentryJaegerCloudWatch

πŸ’° Salary by region

RegionJuniorMidSenior
USA$110k$155k$220k
UKΒ£65kΒ£100kΒ£150k
EU€70k€105k€160k
CANADAC$115kC$160kC$235k

❓ FAQ

Monitoring vs Observability β€” are they the same thing?
Monitoring tells you WHEN something is wrong (alerts, thresholds). Observability tells you WHY it's wrong (logs, metrics, traces, debugging). Modern systems need both: monitoring + observability. Monitoring = metrics-driven (do we have a problem?). Observability = data-driven (why do we have a problem?). Build observability first, monitoring alerts follow.
Should I start with Prometheus or Datadog?
Prometheus: free, open-source, pull-based metrics, best for self-hosted Kubernetes clusters. Requires operational overhead (storage, alerting setup). Datadog: SaaS, push-based, includes logs + traces + APM, $$ per metric. For learning: start Prometheus (free). For production SaaS: Datadog or New Relic (operational simplicity). Many use both: Prometheus for internal metrics, Datadog for external monitoring.
What are the 3 pillars of observability?
Logs (what happened): request logs, error messages, application events. Metrics (how much): CPU usage, request count, latency, error rates. Traces (where time spent): distributed tracing across microservices, showing request flow through services. All three together give you the picture: metrics show the symptom, logs show the context, traces show the path.
How do I handle high cardinality metrics without breaking my bill?
High cardinality = many unique label combinations (e.g., per-user metrics). Problem: cardinality explosions double your Datadog bill. Solution: (1) avoid unbounded labels, (2) use tag groups/exclusions, (3) pre-aggregate on client side, (4) use sampling for noisy metrics, (5) switch to count metrics instead of gauge. Monitor cardinality growth in Datadog metrics explorer before it surprises you.
What makes a good SLO (Service Level Objective)?
SLO = target uptime % (e.g., 99.9% = 'three nines'). Start with 99% (3.7h downtime/month). Good SLOs: (1) based on user impact not infrastructure, (2) achievable but aspirational, (3) business-aligned (not arbitrary), (4) tracked with error budgets. Don't just copy Google's SLOs. If you're doing 99.99%, you need on-call rotations and expense approval.
How do I debug a distributed trace that's slow?
Use Jaeger or Honeycomb. Find slowest span: (1) end-to-end latency via trace timeline, (2) identify service with worst duration, (3) check service logs at that timestamp, (4) look for blocking calls (DB queries, network waits), (5) check concurrent spans (parallelism opportunities). Correlation: trace ID in logs helps jump between traces and logs. Tag traces with user ID / request ID early.
What's the cost-benefit of OpenTelemetry vs proprietary agents?
OpenTelemetry: vendor-agnostic, no lock-in, more instrumentation work, future-proof. Proprietary (Datadog agent, New Relic): easier setup, built-in optimizations, vendor lock-in. Best practice: instrument with OpenTelemetry, export to your chosen backend (Datadog/Honeycomb/GCP). Gives you portability: switch backends without re-instrumenting.

Not sure this skill is for you?

Take a 10-min Career Match β€” we'll suggest the right tracks.

Find my best-fit skills β†’

Find your ideal career path

Skill-based matching across 2,536 careers. Free, ~10 minutes.

Take Career Match β€” free β†’