Runbooks
Step-by-step recovery procedures for known failure modes.
Runbooks
A runbook is a script. It is written so a tired, stressed engineer can follow it at 3am and restore service without having to reason from first principles. Each runbook in this section targets a specific, recognisable failure symptom and walks through detection, diagnosis, mitigation, and recovery.
Runbook authoring
New runbooks are welcome — every incident we resolve should teach the team one. The template:
# Runbook: <symptom>
**Owner:** <team>
**Last tested:** YYYY-MM-DD
**Pager rule:** <alert name>
## Goal
Restore <X> when <Y>.
## Detection
How to confirm this is the failure you are hitting.
## Diagnosis
Fast checks to localise the fault.
## Mitigation
Immediate actions to restore user-facing health.
## Recovery
Steps to return to a fully healthy state.
## Rollback
How to undo each state-changing step.
## Escalation
When and whom to escalate to.
## Postmortem
Link to the incident doc once one exists.Available runbooks
The catalogue is populated as real incidents drive new runbooks. Writing a runbook "just in case" is usually wasted effort; writing one in the follow-up from an actual incident captures the specific, sharp-edged lessons a generic version would miss.
See On-Call for rotation logistics and Troubleshooting for exploratory diagnostic trees.