Operations
Troubleshooting
Common failure symptoms and how to localise them — service-by-service diagnostic trees.
Troubleshooting
This page is for the "something feels off" moment, before you know which runbook to follow. It is a set of diagnostic trees — start from the symptom you can see, follow the branch that narrows the cause, then consult the matching runbook or escalate.
Symptom: the frontend is blank after sign-in
- Check the browser console. Look for 401/403 from
wordloop-core→ Clerk token issue. Look for 5xx → backend issue. - Check the Core service health. Hit
/healthzon Core. If it responds, the backend is up; the problem is in auth or in the specific call the app makes first. - Check JWT verification logs on Core for the incoming request. A mismatch between the Clerk environment and the Core configuration will produce "token signature does not verify" here.
Symptom: transcription lag is spiking
- Check the ML service trace. Filter for
transcribe.turnspans with latency > SLO. If the model call itself is slow, the model provider or network is the cause. - Check the model-client adapter logs. Rate-limit responses from the provider surface here.
- Check the audio queue depth. If the queue is deep, consumers are not keeping up — scale the ML workers or investigate a backpressure signal.
Symptom: WebSocket connections drop repeatedly
- Check the gateway logs for timeout errors — that usually indicates a platform-layer idle timeout below our expected session length.
- Check the client reconnect pattern. A flood of reconnects from one client suggests a client-side bug; a broader pattern suggests a server-side issue.
- Check for
BACKPRESSURE_SHEDerror frames. If clients are being shed, the server is overloaded — check the SLO dashboard.
Symptom: deploys are failing in CI
- Check the CI logs for the failing step. Most failures are one of: tests broke, image build broke, vulnerability scan flagged a dependency.
- If tests broke, run them locally (
./dev test <service>) — a flaky test should be fixed, not retried. - If the image build broke, often due to Dockerfile layer changes or base-image updates. The CI log shows the layer.
- If the vulnerability scan flagged, the dependency audit is doing its job. Upgrade the dependency or add a justified waiver.
When to move to a runbook
If you have localised the symptom to a known failure mode (database slow, cache cold, model provider degraded, Pub/Sub backed up), move to the corresponding runbook for the recovery procedure.
When to escalate
- Symptom is user-visible and you cannot localise it within 10 minutes.
- Symptom involves suspected security or privacy breach — escalate immediately (Security, Privacy).
- Symptom is a novel failure mode not covered by any runbook. Document it in the postmortem for future detection.
See On-Call for the escalation tree.