βΆMonitoring vs Observability β are they the same thing?
Monitoring tells you WHEN something is wrong (alerts, thresholds). Observability tells you WHY it's wrong (logs, metrics, traces, debugging). Modern systems need both: monitoring + observability. Monitoring = metrics-driven (do we have a problem?). Observability = data-driven (why do we have a problem?). Build observability first, monitoring alerts follow.
βΆShould I start with Prometheus or Datadog?
Prometheus: free, open-source, pull-based metrics, best for self-hosted Kubernetes clusters. Requires operational overhead (storage, alerting setup). Datadog: SaaS, push-based, includes logs + traces + APM, $$ per metric. For learning: start Prometheus (free). For production SaaS: Datadog or New Relic (operational simplicity). Many use both: Prometheus for internal metrics, Datadog for external monitoring.
βΆWhat are the 3 pillars of observability?
Logs (what happened): request logs, error messages, application events. Metrics (how much): CPU usage, request count, latency, error rates. Traces (where time spent): distributed tracing across microservices, showing request flow through services. All three together give you the picture: metrics show the symptom, logs show the context, traces show the path.
βΆHow do I handle high cardinality metrics without breaking my bill?
High cardinality = many unique label combinations (e.g., per-user metrics). Problem: cardinality explosions double your Datadog bill. Solution: (1) avoid unbounded labels, (2) use tag groups/exclusions, (3) pre-aggregate on client side, (4) use sampling for noisy metrics, (5) switch to count metrics instead of gauge. Monitor cardinality growth in Datadog metrics explorer before it surprises you.
βΆWhat makes a good SLO (Service Level Objective)?
SLO = target uptime % (e.g., 99.9% = 'three nines'). Start with 99% (3.7h downtime/month). Good SLOs: (1) based on user impact not infrastructure, (2) achievable but aspirational, (3) business-aligned (not arbitrary), (4) tracked with error budgets. Don't just copy Google's SLOs. If you're doing 99.99%, you need on-call rotations and expense approval.
βΆHow do I debug a distributed trace that's slow?
Use Jaeger or Honeycomb. Find slowest span: (1) end-to-end latency via trace timeline, (2) identify service with worst duration, (3) check service logs at that timestamp, (4) look for blocking calls (DB queries, network waits), (5) check concurrent spans (parallelism opportunities). Correlation: trace ID in logs helps jump between traces and logs. Tag traces with user ID / request ID early.
βΆWhat's the cost-benefit of OpenTelemetry vs proprietary agents?
OpenTelemetry: vendor-agnostic, no lock-in, more instrumentation work, future-proof. Proprietary (Datadog agent, New Relic): easier setup, built-in optimizations, vendor lock-in. Best practice: instrument with OpenTelemetry, export to your chosen backend (Datadog/Honeycomb/GCP). Gives you portability: switch backends without re-instrumenting.