Question 1

Monitoring vs Observability — are they the same thing?

Accepted Answer

Monitoring tells you WHEN something is wrong (alerts, thresholds). Observability tells you WHY it's wrong (logs, metrics, traces, debugging). Modern systems need both: monitoring + observability. Monitoring = metrics-driven (do we have a problem?). Observability = data-driven (why do we have a problem?). Build observability first, monitoring alerts follow.

Question 2

Should I start with Prometheus or Datadog?

Accepted Answer

Prometheus: free, open-source, pull-based metrics, best for self-hosted Kubernetes clusters. Requires operational overhead (storage, alerting setup). Datadog: SaaS, push-based, includes logs + traces + APM, $$ per metric. For learning: start Prometheus (free). For production SaaS: Datadog or New Relic (operational simplicity). Many use both: Prometheus for internal metrics, Datadog for external monitoring.

Question 3

What are the 3 pillars of observability?

Accepted Answer

Logs (what happened): request logs, error messages, application events. Metrics (how much): CPU usage, request count, latency, error rates. Traces (where time spent): distributed tracing across microservices, showing request flow through services. All three together give you the picture: metrics show the symptom, logs show the context, traces show the path.

Question 4

How do I handle high cardinality metrics without breaking my bill?

Accepted Answer

High cardinality = many unique label combinations (e.g., per-user metrics). Problem: cardinality explosions double your Datadog bill. Solution: (1) avoid unbounded labels, (2) use tag groups/exclusions, (3) pre-aggregate on client side, (4) use sampling for noisy metrics, (5) switch to count metrics instead of gauge. Monitor cardinality growth in Datadog metrics explorer before it surprises you.

Question 5

What makes a good SLO (Service Level Objective)?

Accepted Answer

SLO = target uptime % (e.g., 99.9% = 'three nines'). Start with 99% (3.7h downtime/month). Good SLOs: (1) based on user impact not infrastructure, (2) achievable but aspirational, (3) business-aligned (not arbitrary), (4) tracked with error budgets. Don't just copy Google's SLOs. If you're doing 99.99%, you need on-call rotations and expense approval.

Question 6

How do I debug a distributed trace that's slow?

Accepted Answer

Use Jaeger or Honeycomb. Find slowest span: (1) end-to-end latency via trace timeline, (2) identify service with worst duration, (3) check service logs at that timestamp, (4) look for blocking calls (DB queries, network waits), (5) check concurrent spans (parallelism opportunities). Correlation: trace ID in logs helps jump between traces and logs. Tag traces with user ID / request ID early.

Question 7

What's the cost-benefit of OpenTelemetry vs proprietary agents?

Accepted Answer

OpenTelemetry: vendor-agnostic, no lock-in, more instrumentation work, future-proof. Proprietary (Datadog agent, New Relic): easier setup, built-in optimizations, vendor lock-in. Best practice: instrument with OpenTelemetry, export to your chosen backend (Datadog/Honeycomb/GCP). Gives you portability: switch backends without re-instrumenting.

Region	Junior	Mid	Senior
USA	$110k	$155k	$220k
UK	£65k	£100k	£150k
EU	€70k	€105k	€160k
CANADA	C$115k	C$160k	C$235k

Monitoring & Observability

What is Monitoring & Observability

📋 Before you start

💰 Salary by region

🎓 Certifications

🎯 Careers using Monitoring & Observability

⚖ Compare with

❓ FAQ

Not sure this skill is for you?

Find your ideal career path