Question 1

Prometheus vs Datadog vs New Relic, why choose Prometheus?

Accepted Answer

Prometheus is open-source, free-tier unlimited (self-hosted), and has zero per-metric costs. Datadog and New Relic charge per metric (typically $0.10-1.50/metric/month). At 1,000 metrics × 12 months, Datadog costs $1,200-18,000/year; Prometheus costs $0/year (you pay for compute). Trade-off: you manage infrastructure (HA, backups, storage). For startups and cost-conscious orgs: Prometheus + Grafana. For fully-managed SaaS: Datadog/New Relic. Hybrid: Prometheus for internal, Datadog for complex third-party integrations.

Question 2

What is PromQL and how difficult is it to learn?

Accepted Answer

PromQL is Prometheus's time-series query language. Learning curve: 1-2 weeks for basics (rate(), increase(), topk()). It's simpler than SQL and more intuitive than Splunk SPL. Key concepts: selectors (label matching), range vectors (5m avg), aggregation (sum/avg/max), and math. Most Prometheus users write ~30 standard queries repeatedly; advanced features (histograms, quantiles) take 4-8 weeks to master. Beginner PromQL queries: `rate(http_requests_total[5m])` (requests per second), `node_memory_MemFree_bytes / node_memory_MemTotal_bytes` (memory utilization %). Practice on public demo at play.grafana.org.

Question 3

Cardinality explosion, what is it and how do I avoid it?

Accepted Answer

Cardinality = unique label combinations. If you have a metric `http_requests_total` with labels `method`, `status`, `path`, and path is unbounded (includes user IDs, timestamps, random values), you'll create millions of unique time series. Each costs memory and query time. Example: `http_requests_total{method='GET',status='200',path='/users/123'}` × thousands of user IDs = millions of series. Fix: cardinality limits in scrape config (max 10k series per target), remove unbounded labels (drop user_id), or bucket high-cardinality values (path='/users/*' instead of individual paths). Monitor with `prometheus_tsdb_metric_chunks_created_total`.

Question 4

Recording rules vs alerting rules, what's the difference?

Accepted Answer

Recording rules pre-compute expensive queries and store results as new metrics. Example: instead of calculating `rate(http_requests_total[5m])` on every dashboard load, create a recording rule that computes it every 30 seconds and stores as `http_requests_per_second`. Use when: complex queries, high cardinality, multi-step math. Alerting rules fire alerts when conditions are true. Example: `alert if http_error_rate > 5%`. Recording rules are for optimization; alerting rules are for notification. Most setups use both: recording rules precompute metrics, alerting rules trigger on them.

Question 5

RED vs USE method, which monitoring approach should I use?

Accepted Answer

RED (Rate-Errors-Duration) for services: monitor request rate, error rate, request duration (p50/p95/p99). Example: HTTP service, track requests/sec, 5xx error %, and response time latency. USE (Utilization-Saturation-Errors) for resources: monitor CPU/memory/disk utilization, saturation (queue depth), errors. Example: database server, track CPU %, memory %, query queue depth, failed queries. Most teams use both: RED for application-level SLOs, USE for infrastructure health. Prometheus shines at both, combine them for 360-degree observability.

Question 6

Mimir vs Thanos, which long-term storage solution do I need?

Accepted Answer

Thanos is an open-source sidecar that adds 10+ years of retention and cross-cluster queries. Mimir is a fully-managed Grafana Cloud product (or self-hosted) optimized for scale and multi-tenancy. Thanos: free, DIY ops, S3-compatible storage (Minio, GCS, Azure), downsampling. Mimir: managed simplicity, native TSDB, built-in multi-tenancy, query federation, ~$0.26-2.50/GB/month for Grafana Cloud. Choose Thanos for: self-hosted, long history, cost control. Choose Mimir for: SaaS simplicity, high availability, multi-team isolation.

Question 7

How do I get started, install locally or use Grafana Cloud?

Accepted Answer

For learning: install Prometheus + Grafana locally with Docker (`docker run prom/prometheus`) in 5 minutes. For production: Kubernetes Helm charts (kube-prometheus-stack = Prometheus + Grafana + Alertmanager in one chart). For managed SaaS: Grafana Cloud (free tier: 10k metrics, 7-day retention). Kubernetes teams always use kube-prometheus-stack + node_exporter + kube-state-metrics for cluster-wide observability. Most Grafana dashboards are shared at grafana.com/dashboards, import templates rather than building from scratch.

Region	Junior	Mid	Senior
USA	$98k	$155k	$215k
UK	£55k	£92k	£140k
EU	€62k	€98k	€150k
CANADA	C$102k	C$160k	C$220k

Region	Junior	Mid	Senior
USA	$98k	$155k	$215k
UK	£55k	£92k	£140k
EU	€62k	€98k	€150k
CANADA	C$102k	C$160k	C$220k

Grafana & Prometheus

What is Grafana & Prometheus

📋 Before you start

💰 Salary by region

🎓 Certifications

🎯 Careers using Grafana & Prometheus

⚖ Compare with

❓ FAQ

Not sure this skill is for you?

Find your ideal career path

Grafana & Prometheus

What is Grafana & Prometheus

📋 Before you start

💰 Salary by region

🎓 Certifications

🎯 Careers using Grafana & Prometheus

⚖ Compare with

❓ FAQ

Not sure this skill is for you?

Find your ideal career path

Grafana & Prometheus

What is Grafana & Prometheus

📋 Before you start

💰 Salary by region

🎓 Certifications

🎯 Careers using Grafana & Prometheus

⚖ Compare with

❓ FAQ

🔗 Related skills

Not sure this skill is for you?

Find your ideal career path

Grafana & Prometheus

What is Grafana & Prometheus

📋 Before you start

💰 Salary by region

🎓 Certifications

🎯 Careers using Grafana & Prometheus

⚖ Compare with

❓ FAQ

🔗 Related skills

Not sure this skill is for you?

Find your ideal career path