Incident management

On-call scheduling: a practical guide

· PingInsight Team · 4 min read

Short version: good on-call keeps the right person reachable without burning out the team. The essentials are a fair rotation, timezone-aware schedules, an escalation policy for when the first responder doesn't answer, and maintenance windows so planned work doesn't page anyone. Detection quality underpins all of it — people only trust on-call when alerts are real.

Build a fair rotation

Decide a rotation cadence (weekly is common) and who's in it. Rotate fairly, account for time zones so nobody is paged at 3 a.m. every shift, and allow overrides for holidays and sick days. The goal is coverage without resentment — an exhausted on-call engineer is a slow one.

Layer in escalation

A schedule says who's primary. An escalation policy says what happens when the primary doesn't acknowledge in time: notify them again, then page a secondary, then a manager. Ordered escalation steps with sensible intervals mean a missed alert doesn't become a missed outage. PingInsight supports escalations and on-call rotations on Business plans.

Suppress noise with maintenance windows

Planned deploys and migrations will trip your monitors. Schedule maintenance windows so alerts are suppressed during known work — and so status-page subscribers are notified in advance rather than surprised. This keeps the signal-to-noise ratio high, which is the whole game.

The foundation: alerts people trust

On-call only works if responders believe their pages. Two failure modes destroy that trust:

  • Too slow: minute-level polling means responders learn about outages from customers first, which makes on-call feel pointless.
  • Too noisy: naive fast polling fires on every blip, and people start ignoring alerts.

The fix is fast and confirmed detection. 1-second checks catch incidents early, while multi-location quorum requires multiple regions or consecutive failures before declaring DOWN — keeping false positives under 0.5%. Fast alerts that are almost always real are the ones people actually answer.

A quick checklist

  • [ ] Defined rotation with fair, timezone-aware shifts
  • [ ] Override mechanism for time off
  • [ ] Escalation policy with multiple steps
  • [ ] Maintenance windows that suppress planned-work alerts
  • [ ] A single channel where alerts land
  • [ ] Confirmed, low-false-positive detection feeding it all

Learn more about incident management, or read the incident metrics that on-call directly improves.

Read next

Start monitoring in under a minute

Free forever, no credit card. Upgrade when you need finer intervals.