βΆWhat's the difference between RTO and RPO, and how do I set targets?
RTO (Recovery Time Objective) = max acceptable downtime (e.g., 1 hour). RPO (Recovery Point Objective) = max acceptable data loss (e.g., 15 minutes). A 1-hour RTO with 15-min RPO means: restore service within 60 min AND lose no more than 15 min of transactions. Set targets based on business impact: financial systems need tight targets (RTO < 15 min, RPO < 5 min); internal tools can tolerate 4-8 hours. Use SLOs to backpressure engineering: if targets are impossible given current infrastructure cost, negotiate with business.
βΆWhen should I use active-active vs active-passive multi-region DR?
Active-passive (cheaper, simpler): primary region handles all traffic; standby region stays idle. Failover takes 5-30 min (depends on DNS TTL + health checks). Costs ~50% extra. Active-active (expensive, complex): traffic splits across regions; no planned downtime for failover. Requires distributed consensus (hard for stateful systems). Costs 2x-3x. Use active-passive for most apps; active-active only if 5-min RTO is critical and you have cross-region replication + global load balancing. Most orgs use active-passive until they outgrow it.
βΆHow do I make backups ransomware-proof?
Immutable backups: once written, backups cannot be modified or deleted (WORM = Write Once Read Many). AWS Backup supports immutability for snapshots and vaults. Object Lock (S3) enforces 7-30 day retention; attacker cannot delete. Air-gapped backups: offline copy in another account (cross-account, cross-region) with IAM deny policies. Test restores monthly: ransomware often corrupts backups silently; verification catches it. Separate credentials for backup access (least privilege). Never grant data-deletion privileges to app service accounts.
βΆShould I test DR plans, and how often?
Yes, mandatory. Untested DR plans fail when needed (55% of orgs can't actually recover). Frequency: critical systems monthly (full failover), standard apps quarterly, low-priority annually. Full-failover test: bring up backup region, route traffic, verify data + functionality. Costs $2k-$20k per test (infrastructure + teams). Chaos engineering (game-day exercises) catches assumptions: network lag, partial failures, DNS delay. Document findings; update RTO/RPO. Automated chaos (e.g., Gremlin) catches regressions cheaply.
βΆHow do I calculate the cost of a warm vs hot standby?
Warm standby (data synced, app offline): 40-60% of primary cost (storage + replication, no compute). RTO: 15-45 min. Hot standby (fully running, accepting queries): 100-150% of primary cost. RTO: 0-5 min (instant). Cold standby (backups only, no standing infra): 10-20% of primary cost, RTO: 2-8 hours. For a $50k/month primary: warm=$20-30k, hot=$50-75k, cold=$5-10k. Break-even: if < 5 outages/year costing > $100k each, warm pays off. If > 20 outages/year, invest in hot.
βΆWhat does a modern DR plan include?
Runbook per critical system: detection β failover β validation β rollback procedures. Architecture diagrams (primary + standby, data flow). RTO/RPO per app + SLA commitments. Test schedule + past results. Roles assigned (who triggers failover, who runs restore, comms lead). Incident contact list. Backup inventory (location, retention, encryption). Terraform/CloudFormation for reprovisioning (IaC = faster recovery). Validation scripts (health checks, data integrity). Recovery metrics dashboard (restore duration trend). Annual plan review.
βΆHow do I handle backup + restore for databases at scale?
Snapshot-based (AWS/Azure): fast (minutes), point-in-time recovery, works for most DBs. Shipping transaction logs (pgBackRest, mysqldump stream): continuous replication, lower RPO, slower restore. Logical replication (primary β standby): zero-downtime failover, read-scaling on standby, lag 0-500ms. For PostgreSQL: use pgBackRest (WAL archiving + full backups). For MySQL: Percona XtraBackup (incremental backups) + binary log shipping. For DynamoDB/Firestore: automatic replication to another region. Test restores on a nonprod replica to validate compression + encryption.
βΆWhat's the role of chaos engineering in DR testing?
Chaos validates assumptions: network latency, partial failures, resource exhaustion. Kill random pods (Kubernetes) β does failover trigger? Slow down S3 API β does restore hang? Inject packet loss β do replication lag alarms fire? Tools: Gremlin (SaaS, simplest), Chaoskube (Kubernetes), Pumba (Docker). Start: 1 game-day/quarter on non-critical system. Measure: MTTR (Mean Time To Recover) improves 30-50% after first chaos run. Keep results in runbook. Share findings with on-call team.