Short version: the incident metrics that matter most are MTTD (how fast you detect), MTTA (how fast someone acknowledges), MTTR (how fast you recover), and MTBF (how long between failures). Detection sits at the front of the chain — every minute you don't know about an outage is added directly to MTTR.
The four metrics
- MTTD — Mean Time To Detect. The average time from when an incident starts to when your system notices. This is the metric monitoring directly controls.
- MTTA — Mean Time To Acknowledge. The average time from alert to a human acknowledging it. On-call schedules and escalation policies drive this down.
- MTTR — Mean Time To Recovery (or Repair/Resolve). The average time from detection to service restored. The headline reliability metric.
- MTBF — Mean Time Between Failures. The average uptime between incidents — a measure of how often things break, not how fast you fix them.
They form a timeline
A single incident runs: failure starts → detected (MTTD) → acknowledged (MTTA) → recovered (MTTR) → … → next failure (MTBF). The important insight is that these add up. If detection takes 5 minutes because you poll every 5 minutes, that 5 minutes is baked into your total time-to-recovery no matter how fast your team is afterward.
Detection is the cheapest minute to win
Most teams pour effort into the response side — runbooks, on-call rotations, faster rollbacks — and those matter. But the single easiest place to shave minutes off MTTR is detection. Going from 5-minute polling to 1-second checks removes up to ~5 minutes of pure detection lag from every incident, before your team does anything.
That's also why false positives are dangerous: if fast polling pages people for blips that aren't real outages, MTTA quietly rises because responders stop trusting alerts. PingInsight confirms outages with multi-location quorum to keep the false-positive rate under 0.5%, so fast detection doesn't become alert fatigue.
How to improve each one
- Lower MTTD: monitor more often and from multiple locations; alert on the right SLIs.
- Lower MTTA: clear on-call schedules, escalation steps, and a single channel for alerts.
- Lower MTTR: runbooks, practiced rollbacks, and incident management with a timeline so everyone shares context.
- Raise MTBF: fix root causes via postmortems instead of just restarting.
Want to see what slow detection costs in dollars? Try the downtime cost calculator.