βΆWhat's the difference between MTTR, MTTD, and MTBF? Why do they matter?
MTTD (mean time to detect) = how fast your monitoring spots the issue, MTTR (mean time to repair) = how fast your team fixes it, MTBF (mean time between failures) = how often outages happen. For a $10M/hour SaaS: reducing MTTD from 30min to 5min saves $4M per incident. Incident management focuses on MTTD (better alerts) and MTTR (faster triage/resolution). SRE/reliability engineering focuses on MTBF (design systems that fail less often). Track all three in your SLOs.
βΆWhat makes a blameless postmortem actually work? Aren't you just excusing bad behavior?
Blameless means focusing on systems, not individuals. Questions: Why didn't monitoring catch this sooner? Why did the runbook not apply? Why did we have a single point of failure? This finds root causes and prevents repeats. Contrast: 'Alice pushed bad code' (blaming) vs 'Deploys lack pre-prod testing' (systemic). Blameless doesn't mean consequence-free β if someone is reckless repeatedly, that's a management conversation. But 99% of outages are process/architecture failures, not individual incompetence. Teams with blameless postmortems fix more bugs and have better retention.
βΆHow do incident severity levels actually work? SEV-1 vs SEV-2 vs SEV-3?
Standard scale: SEV-1 = complete service down or major feature broken for >10% of users (all hands on deck, declare incident), SEV-2 = significant degradation (single region down, 1-10% impact, scheduled escalation), SEV-3 = minor or cosmetic (single user or non-critical path, monitor and log). Define thresholds upfront (% of users + duration) so triage is consistent, not emotional. Page oncall for SEV-1/2 only. Keep SEV-1s rare (< 1/month is healthy). If you're declaring 10+ SEV-1s per week, your monitoring is noisy or your architecture is fragile.
βΆHow do I design on-call rotations without burning people out? 8 hours? 24 hours? Async?
Healthy pattern: 1-week rotations for 1 person (not pairs unless you're >100 engineers). Async on-call means responder has 15 mins to engage; sync means paged immediately. For 50-person teams: 1 week sync (get paged at 3am), then 3-4 weeks async (5-min response SLA). Pair with 'follow the sun' across time zones. Provide escalation paths (if oncall doesn't respond in 5 min, page the next person). Compensate: 1-2 extra days off after an on-call week, or +$100-300/week on-call stipend. Burnout signal = people dreading their week = high turnover. Track 'pages per week' per oncall; >2 pages/week = too much toil.
βΆWhat's an incident commander and what do they actually do during a SEV-1?
Incident commander (IC) is the single voice during a crisis: takes reports, prioritizes actions, delegates work, communicates to stakeholders. Not the person fixing the bug (that's engineers), but the conductor. IC's job: 1) declare incident start, 2) set severity, 3) form a war room (chat channel, optional call), 4) keep a timeline (started at 14:23 UTC, mitigation at 14:31, resolved at 14:47), 5) delegate (database team diagnoses, frontend team rolls back, comms person updates status page), 6) give status updates every 10 mins. IC training takes 4 hours. Rotate IC role β teaches leadership + systems thinking.
βΆWar rooms vs Slack threads for incidents β which is better? Do I really need a call?
Slack threads = async, searchable, no one has to context-switch to a Zoom. Works for SEV-2/3. War rooms (video call + shared doc) = real-time coordination, instant escalation, no timeout delays. Use for SEV-1. If diagnosis is complex (distributed system mystery), a 10-person call wastes time. Better: 1 engineer debugging + 1 IC monitoring, with pair-programming the fix once the root cause is found. Incident.io / FireHydrant / Slack Workflow can auto-create channels + war rooms + status page updates when you declare incident.
βΆWhat's the role of AI auto-remediation in 2026 incident response? Is it a game-changer?
Auto-remediation (e.g., auto-scaling in response to high CPU, auto-rollback a bad deploy) can reduce MTTR for known failure modes by 80-90%, but introducing AI agents into critical path has its own risks. Better approach: AI-assisted diagnosis (alert context + logs β 'likely cause: memory leak in auth service') + runbook suggestion, with humans approving/executing the fix. Some teams use AI to auto-execute well-tested playbooks (e.g., 'restart pod', 'failover to replica') but with human kill-switch. 2026 trend: more observability platforms offering AI-native features (Datadog, New Relic, Grafana). If your alert volume is >100/week, AI triage is worth the cost (~$500/mo). If <20/week, simpler tooling is fine.