WordloopWordloop
Operations

Troubleshooting

Common failure symptoms and how to localise them — service-by-service diagnostic trees.

Troubleshooting

This page is for the "something feels off" moment, before you know which runbook to follow. It is a set of diagnostic trees — start from the symptom you can see, follow the branch that narrows the cause, then consult the matching runbook or escalate.

Symptom: the frontend is blank after sign-in

  1. Check the browser console. Look for 401/403 from wordloop-core → Clerk token issue. Look for 5xx → backend issue.
  2. Check the Core service health. Hit /healthz on Core. If it responds, the backend is up; the problem is in auth or in the specific call the app makes first.
  3. Check JWT verification logs on Core for the incoming request. A mismatch between the Clerk environment and the Core configuration will produce "token signature does not verify" here.

Symptom: transcription lag is spiking

  1. Check the ML service trace. Filter for transcribe.turn spans with latency > SLO. If the model call itself is slow, the model provider or network is the cause.
  2. Check the model-client adapter logs. Rate-limit responses from the provider surface here.
  3. Check the audio queue depth. If the queue is deep, consumers are not keeping up — scale the ML workers or investigate a backpressure signal.

Symptom: WebSocket connections drop repeatedly

  1. Check the gateway logs for timeout errors — that usually indicates a platform-layer idle timeout below our expected session length.
  2. Check the client reconnect pattern. A flood of reconnects from one client suggests a client-side bug; a broader pattern suggests a server-side issue.
  3. Check for BACKPRESSURE_SHED error frames. If clients are being shed, the server is overloaded — check the SLO dashboard.

Symptom: deploys are failing in CI

  1. Check the CI logs for the failing step. Most failures are one of: tests broke, image build broke, vulnerability scan flagged a dependency.
  2. If tests broke, run them locally (./dev test <service>) — a flaky test should be fixed, not retried.
  3. If the image build broke, often due to Dockerfile layer changes or base-image updates. The CI log shows the layer.
  4. If the vulnerability scan flagged, the dependency audit is doing its job. Upgrade the dependency or add a justified waiver.

When to move to a runbook

If you have localised the symptom to a known failure mode (database slow, cache cold, model provider degraded, Pub/Sub backed up), move to the corresponding runbook for the recovery procedure.

When to escalate

  • Symptom is user-visible and you cannot localise it within 10 minutes.
  • Symptom involves suspected security or privacy breach — escalate immediately (Security, Privacy).
  • Symptom is a novel failure mode not covered by any runbook. Document it in the postmortem for future detection.

See On-Call for the escalation tree.

On this page