Runbooks

A runbook is a script. It is written so a tired, stressed engineer can follow it at 3am and restore service without having to reason from first principles. Each runbook in this section targets a specific, recognisable failure symptom and walks through detection, diagnosis, mitigation, and recovery.

Runbook authoring

New runbooks are welcome — every incident we resolve should teach the team one. The template:

# Runbook: <symptom>

**Owner:** <team>
**Last tested:** YYYY-MM-DD
**Pager rule:** <alert name>

## Goal
Restore <X> when <Y>.

## Detection
How to confirm this is the failure you are hitting.

## Diagnosis
Fast checks to localise the fault.

## Mitigation
Immediate actions to restore user-facing health.

## Recovery
Steps to return to a fully healthy state.

## Rollback
How to undo each state-changing step.

## Escalation
When and whom to escalate to.

## Postmortem
Link to the incident doc once one exists.

Available runbooks

The catalogue is populated as real incidents drive new runbooks. Writing a runbook "just in case" is usually wasted effort; writing one in the follow-up from an actual incident captures the specific, sharp-edged lessons a generic version would miss.

See On-Call for rotation logistics and Troubleshooting for exploratory diagnostic trees.

Runbooks

Runbooks

Runbook authoring

Available runbooks

On this page