Operations
Runbooks, on-call procedures, and troubleshooting guides for engineers responsible for keeping Wordloop up.
Operations
The Operations section is written for the person staring at a red graph at 3am — or the one who will, one day. It is different from Guides: guides walk you through a happy-path operation you want to perform; runbooks walk you through a degraded state you have to respond to.
When to use this section
Troubleshooting
Common failure symptoms and how to localise them — service-by-service diagnostic trees.
On-Call
Rotation, escalation, incident-response protocol, and the tools an on-call engineer needs on hand.
Runbooks
Step-by-step recovery procedures for known failure modes.
Writing for 3am
Operational documentation has a harsh audience: a stressed engineer under time pressure. The bar is high.
- State the goal at the top. Every runbook begins with "This runbook restores X when Y."
- Number the steps. Imperative sentences. Exact commands, exact flags, exact expected output.
- Include rollback. Every step that changes state must explain how to undo it.
- Link to observability. Every step that checks state must link to the dashboard that proves it.
- Close with escalation. If the runbook fails, who or what is next?
See Engineering Principles / Reliability for why we hold this bar.