Question 1

What's the difference between RTO and RPO, and how do I set targets?

Accepted Answer

RTO (Recovery Time Objective) = max acceptable downtime (e.g., 1 hour). RPO (Recovery Point Objective) = max acceptable data loss (e.g., 15 minutes). A 1-hour RTO with 15-min RPO means: restore service within 60 min AND lose no more than 15 min of transactions. Set targets based on business impact: financial systems need tight targets (RTO < 15 min, RPO < 5 min); internal tools can tolerate 4-8 hours. Use SLOs to backpressure engineering: if targets are impossible given current infrastructure cost, negotiate with business.

Question 2

When should I use active-active vs active-passive multi-region DR?

Accepted Answer

Active-passive (cheaper, simpler): primary region handles all traffic; standby region stays idle. Failover takes 5-30 min (depends on DNS TTL + health checks). Costs ~50% extra. Active-active (expensive, complex): traffic splits across regions; no planned downtime for failover. Requires distributed consensus (hard for stateful systems). Costs 2x-3x. Use active-passive for most apps; active-active only if 5-min RTO is critical and you have cross-region replication + global load balancing. Most orgs use active-passive until they outgrow it.

Question 3

How do I make backups ransomware-proof?

Accepted Answer

Immutable backups: once written, backups cannot be modified or deleted (WORM = Write Once Read Many). AWS Backup supports immutability for snapshots and vaults. Object Lock (S3) enforces 7-30 day retention; attacker cannot delete. Air-gapped backups: offline copy in another account (cross-account, cross-region) with IAM deny policies. Test restores monthly: ransomware often corrupts backups silently; verification catches it. Separate credentials for backup access (least privilege). Never grant data-deletion privileges to app service accounts.

Question 4

Should I test DR plans, and how often?

Accepted Answer

Yes, mandatory. Untested DR plans fail when needed (55% of orgs can't actually recover). Frequency: critical systems monthly (full failover), standard apps quarterly, low-priority annually. Full-failover test: bring up backup region, route traffic, verify data + functionality. Costs $2k-$20k per test (infrastructure + teams). Chaos engineering (game-day exercises) catches assumptions: network lag, partial failures, DNS delay. Document findings; update RTO/RPO. Automated chaos (e.g., Gremlin) catches regressions cheaply.

Question 5

How do I calculate the cost of a warm vs hot standby?

Accepted Answer

Warm standby (data synced, app offline): 40-60% of primary cost (storage + replication, no compute). RTO: 15-45 min. Hot standby (fully running, accepting queries): 100-150% of primary cost. RTO: 0-5 min (instant). Cold standby (backups only, no standing infra): 10-20% of primary cost, RTO: 2-8 hours. For a $50k/month primary: warm=$20-30k, hot=$50-75k, cold=$5-10k. Break-even: if < 5 outages/year costing > $100k each, warm pays off. If > 20 outages/year, invest in hot.

Question 6

What does a modern DR plan include?

Accepted Answer

Runbook per critical system: detection → failover → validation → rollback procedures. Architecture diagrams (primary + standby, data flow). RTO/RPO per app + SLA commitments. Test schedule + past results. Roles assigned (who triggers failover, who runs restore, comms lead). Incident contact list. Backup inventory (location, retention, encryption). Terraform/CloudFormation for reprovisioning (IaC = faster recovery). Validation scripts (health checks, data integrity). Recovery metrics dashboard (restore duration trend). Annual plan review.

Question 7

How do I handle backup + restore for databases at scale?

Accepted Answer

Snapshot-based (AWS/Azure): fast (minutes), point-in-time recovery, works for most DBs. Shipping transaction logs (pgBackRest, mysqldump stream): continuous replication, lower RPO, slower restore. Logical replication (primary → standby): zero-downtime failover, read-scaling on standby, lag 0-500ms. For PostgreSQL: use pgBackRest (WAL archiving + full backups). For MySQL: Percona XtraBackup (incremental backups) + binary log shipping. For DynamoDB/Firestore: automatic replication to another region. Test restores on a nonprod replica to validate compression + encryption.

Question 8

What's the role of chaos engineering in DR testing?

Accepted Answer

Chaos validates assumptions: network latency, partial failures, resource exhaustion. Kill random pods (Kubernetes) → does failover trigger? Slow down S3 API → does restore hang? Inject packet loss → do replication lag alarms fire? Tools: Gremlin (SaaS, simplest), Chaoskube (Kubernetes), Pumba (Docker). Start: 1 game-day/quarter on non-critical system. Measure: MTTR (Mean Time To Recover) improves 30-50% after first chaos run. Keep results in runbook. Share findings with on-call team.

Region	Junior	Mid	Senior
USA	$110k	$155k	$210k
UK	£65k	£90k	£125k
EU	€70k	€95k	€140k
CANADA	C$120k	C$165k	C$225k

Disaster Recovery & Backups

What is Disaster Recovery & Backups

📋 Before you start

💰 Salary by region

🎓 Certifications

🎯 Careers using Disaster Recovery & Backups

⚖ Compare with

❓ FAQ

Not sure this skill is for you?

Find your ideal career path