Downtime
Downtime is any period during which a service is unavailable or not responding correctly to users.
Plain-English definitions of the terms behind uptime, SLAs, and incident response.
Downtime is any period during which a service is unavailable or not responding correctly to users.
An error budget is the amount of unreliability an SLO allows — the gap between your target and 100%.
A false positive is an alert for an outage that isn't real — often a transient network blip from one location.
Heartbeat monitoring expects a scheduled job to check in, and alerts you when the expected ping doesn't arrive.
MTBF is the average amount of uptime between incidents — a measure of how often things break.
MTTD is the average time from when an incident begins to when your monitoring notices it.
MTTR is the average time from detecting an incident to restoring service — the headline reliability metric.
On-call is a rotation that designates who is responsible for responding to incidents at any given time.
An SLA is a contractual promise about service reliability, usually with penalties if the target is missed.
An SLI is a measurement of some aspect of service quality, such as the percentage of successful requests.
An SLO is the internal target a team holds a service to, such as 99.9% availability over 30 days.
A status page is a public page that shows a service's current operational status and incident history.
Synthetic monitoring simulates user actions — like a login or checkout flow — to test multi-step journeys.
“Three nines” means 99.9% uptime, which allows about 8 hours 45 minutes of downtime per year.
Uptime is the percentage of time a service is available and responding correctly over a given period.
Free forever, no credit card. Upgrade when you need finer intervals.