WordloopWordloop
Operations

On-Call

Rotation, escalation, incident-response protocol, and the tools an on-call engineer needs on hand.

On-Call

On-call is the contract we sign with our users: if the platform breaks, someone is responsible for putting it back together, and that someone is paged promptly. This page describes how the rotation is structured, how incidents are handled, and the tools an on-call engineer should have open before their shift starts.

Rotation

Primary and secondary on-call shifts run in one-week blocks. The calendar is maintained in our paging system; pages route to the current primary with automatic escalation to the secondary if unacknowledged.

Before your shift

  1. Skim the last two weeks of incidents. Patterns recur — knowing the last time this alert fired is usually the fastest lead.
  2. Confirm paging works. Send yourself a test page; verify the escalation chain.
  3. Verify dashboard access. Observability dashboards, feature-flag console, deploy dashboard, Cloud Run console, database console.
  4. Review recent deploys. A page five minutes after a deploy is almost certainly about the deploy.

When you are paged

  1. Acknowledge within 5 minutes. Even if you are not ready to act, acknowledge stops escalation.
  2. Open the incident channel. The paging system creates one automatically; post your initial assessment there.
  3. Localise, don't rebuild. Use Troubleshooting to find the matching diagnostic tree. Do not write new code in an incident unless necessary.
  4. Apply the relevant runbook. If none exists, write one during the postmortem.
  5. Escalate when stuck. 30 minutes without progress is the soft threshold. Call the secondary; call the service owner; call the service leader.

Communication

The incident channel is the record. Post:

  • What you saw (the symptom).
  • What you checked (the diagnostic path).
  • What you did (the mitigation).
  • Who else is involved.

One line every few minutes is better than radio silence. Other engineers read the channel to decide whether to jump in; absence of updates reads as "this is handled" when it may not be.

After the incident

  • Close the page. Confirm the alert is cleared.
  • Open a postmortem ticket. Use the blameless postmortem template; name the specific reliability assumption that was invalidated.
  • File action items. One concrete, closable ticket per action. "Be more careful" is not an action item.
  • Update the runbook. If the runbook missed a step, fix it while the experience is fresh.

Tools every on-call engineer should have ready

  • Observability dashboards, pinned per service.
  • Deploy dashboard with rollback on hand.
  • Feature-flag console with write access.
  • Cloud Run console with per-service revision access.
  • Database console (read-only by default; write access only on demand, with an audit trail).
  • The team's runbook index.
  • Reliability — the SLO and error-budget model that shapes what gets paged.
  • Troubleshooting — diagnostic trees for common symptoms.
  • Runbooks — step-by-step recovery procedures.

On this page