Skip to main content
JobCannon
All skills

Disaster Recovery & Backups

Prepare for worst-case: RTO/RPO, backup strategies, failover

β¬’ TIER 2Tech
+$15k-
Salary impact
5 months
Time to learn
Medium
Difficulty
12
Careers
TL;DR

Disaster Recovery (DR) is the capability to recover from catastrophic failures (data loss, outages, ransomware) within defined RTO (Recovery Time Objective) and RPO (Recovery Point Objective) targets. DR separates operational backups (point-in-time recovery) from strategic resilience (multi-region failover, chaos engineering). Career path: Practitioner (RTO/RPO targets, backup automation, $90-120k) β†’ Architect (multi-region failover, active-active, $130-180k) β†’ Expert (chaos testing, ransomware-proof strategies, cost optimization, $180-250k) over 4-6 months. Lives adjacent to cloud-platforms, monitoring-observability, cybersecurity.

What is Disaster Recovery & Backups

Disaster recovery (DR) = recovering from catastrophic failures (data loss, outages). RTO (Recovery Time Objective), RPO (Recovery Point Objective). Critical for production systems. L1: Backup scripts, point-in-time recovery

πŸ”§ TOOLS & ECOSYSTEM
AWS BackupAWS DRSAzure Site RecoveryVeleroVeeamRubrikCohesityResticBorgBackuppgBackRestmysqldumpS3 Cross-Region ReplicationTerraform

πŸ“‹ Before you start

πŸ’° Salary by region

RegionJuniorMidSenior
USA$110k$155k$210k
UKΒ£65kΒ£90kΒ£125k
EU€70k€95k€140k
CANADAC$120kC$165kC$225k

❓ FAQ

What's the difference between RTO and RPO, and how do I set targets?
RTO (Recovery Time Objective) = max acceptable downtime (e.g., 1 hour). RPO (Recovery Point Objective) = max acceptable data loss (e.g., 15 minutes). A 1-hour RTO with 15-min RPO means: restore service within 60 min AND lose no more than 15 min of transactions. Set targets based on business impact: financial systems need tight targets (RTO < 15 min, RPO < 5 min); internal tools can tolerate 4-8 hours. Use SLOs to backpressure engineering: if targets are impossible given current infrastructure cost, negotiate with business.
When should I use active-active vs active-passive multi-region DR?
Active-passive (cheaper, simpler): primary region handles all traffic; standby region stays idle. Failover takes 5-30 min (depends on DNS TTL + health checks). Costs ~50% extra. Active-active (expensive, complex): traffic splits across regions; no planned downtime for failover. Requires distributed consensus (hard for stateful systems). Costs 2x-3x. Use active-passive for most apps; active-active only if 5-min RTO is critical and you have cross-region replication + global load balancing. Most orgs use active-passive until they outgrow it.
How do I make backups ransomware-proof?
Immutable backups: once written, backups cannot be modified or deleted (WORM = Write Once Read Many). AWS Backup supports immutability for snapshots and vaults. Object Lock (S3) enforces 7-30 day retention; attacker cannot delete. Air-gapped backups: offline copy in another account (cross-account, cross-region) with IAM deny policies. Test restores monthly: ransomware often corrupts backups silently; verification catches it. Separate credentials for backup access (least privilege). Never grant data-deletion privileges to app service accounts.
Should I test DR plans, and how often?
Yes, mandatory. Untested DR plans fail when needed (55% of orgs can't actually recover). Frequency: critical systems monthly (full failover), standard apps quarterly, low-priority annually. Full-failover test: bring up backup region, route traffic, verify data + functionality. Costs $2k-$20k per test (infrastructure + teams). Chaos engineering (game-day exercises) catches assumptions: network lag, partial failures, DNS delay. Document findings; update RTO/RPO. Automated chaos (e.g., Gremlin) catches regressions cheaply.
How do I calculate the cost of a warm vs hot standby?
Warm standby (data synced, app offline): 40-60% of primary cost (storage + replication, no compute). RTO: 15-45 min. Hot standby (fully running, accepting queries): 100-150% of primary cost. RTO: 0-5 min (instant). Cold standby (backups only, no standing infra): 10-20% of primary cost, RTO: 2-8 hours. For a $50k/month primary: warm=$20-30k, hot=$50-75k, cold=$5-10k. Break-even: if < 5 outages/year costing > $100k each, warm pays off. If > 20 outages/year, invest in hot.
What does a modern DR plan include?
Runbook per critical system: detection β†’ failover β†’ validation β†’ rollback procedures. Architecture diagrams (primary + standby, data flow). RTO/RPO per app + SLA commitments. Test schedule + past results. Roles assigned (who triggers failover, who runs restore, comms lead). Incident contact list. Backup inventory (location, retention, encryption). Terraform/CloudFormation for reprovisioning (IaC = faster recovery). Validation scripts (health checks, data integrity). Recovery metrics dashboard (restore duration trend). Annual plan review.
How do I handle backup + restore for databases at scale?
Snapshot-based (AWS/Azure): fast (minutes), point-in-time recovery, works for most DBs. Shipping transaction logs (pgBackRest, mysqldump stream): continuous replication, lower RPO, slower restore. Logical replication (primary β†’ standby): zero-downtime failover, read-scaling on standby, lag 0-500ms. For PostgreSQL: use pgBackRest (WAL archiving + full backups). For MySQL: Percona XtraBackup (incremental backups) + binary log shipping. For DynamoDB/Firestore: automatic replication to another region. Test restores on a nonprod replica to validate compression + encryption.
What's the role of chaos engineering in DR testing?
Chaos validates assumptions: network latency, partial failures, resource exhaustion. Kill random pods (Kubernetes) β†’ does failover trigger? Slow down S3 API β†’ does restore hang? Inject packet loss β†’ do replication lag alarms fire? Tools: Gremlin (SaaS, simplest), Chaoskube (Kubernetes), Pumba (Docker). Start: 1 game-day/quarter on non-critical system. Measure: MTTR (Mean Time To Recover) improves 30-50% after first chaos run. Keep results in runbook. Share findings with on-call team.

Not sure this skill is for you?

Take a 10-min Career Match β€” we'll suggest the right tracks.

Find my best-fit skills β†’

Find your ideal career path

Skill-based matching across 2,536 careers. Free, ~10 minutes.

Take Career Match β€” free β†’