WordloopWordloop
Decisions (ADRs)

Stateful containers for the ML service

We run wordloop-ml as long-lived orchestrated containers rather than serverless functions to keep models warm for real-time transcription.

0003 — Stateful containers for wordloop-ml

Status: Accepted Date: 2026-04-19 Deciders: ml platform Supersedes:Superseded by:

Context

The ML service is responsible for real-time transcription of live Meeting audio, MeetingSynthesis generation from finalised Transcriptions, and embedding generation for retrieval. The transcription path is latency-critical: from the moment a person speaks to the moment the caption renders, the user-perceived budget is under one second.

Serverless function platforms — Lambda, Cloud Run with scale-to-zero, Vercel Edge — are excellent for bursty, stateless workloads with tolerant latency budgets. They are a poor fit for workloads that require:

  1. Large model weights loaded into memory (several hundred MB to several GB).
  2. Connection-level state for streaming audio frames.
  3. Cold start times measured in seconds, which translate directly into user-visible silence during a live meeting.

A cold start of five to ten seconds on the first segment of a Meeting destroys the real-time experience. Warm-up pings mitigate but do not eliminate this, and the cost of keeping a serverless function permanently warm approaches the cost of a dedicated container.

Decision

Run wordloop-ml as long-lived FastAPI workers inside orchestrated containers. Models are loaded at container start and remain resident across requests. The container is the unit of scaling — we scale horizontally by adding more containers, not by spinning up more cold functions.

Consequences

Models stay warm. The first segment of a Meeting transcribes with the same latency as the hundredth. No cold-start penalty on the user-visible path.

Streaming state is preserved. An audio stream's position, rolling buffer, and partial transcription state live in the container that handles the stream. No cross-invocation state-reconstruction step.

Operational posture matches a normal service. The ML service has rolling deploys, health checks, graceful shutdown, and horizontal scaling — the same operational shape as wordloop-core. On-call engineers use the same mental model.

We pay for idle capacity. A serverless model would scale to zero at night; our containers do not. At current traffic this is cheaper than the alternative (warm-keeping costs in a serverless model exceed the dedicated container cost), but the crossover point will change with usage patterns.

Alternatives considered

  • Lambda / Cloud Functions with scale-to-zero. Rejected for cold-start latency on the transcription hot path.
  • Cloud Run with always-on minimum instances. Considered, and a reasonable alternative. We chose explicit container orchestration because it also handles the streaming-state requirement cleanly; Cloud Run's per-request model is awkward for long-lived WebSocket-adjacent connections. Revisit if Cloud Run's streaming support matures.
  • Dedicated GPU nodes. Not yet required — our current model mix runs adequately on CPU. If we adopt models that demand GPU inference, the decision to run stateful containers still holds; we add GPU node pools.
  • Batch transcription only (no real-time path). Rejected as a product decision — live transcription is a core Wordloop feature.

Debt annotation

Principal: Moderate. Operating a stateful service means we handle graceful shutdown, connection draining, and rolling-deploy choreography ourselves. This is well-trodden ground and our Go core already does the same.

Interest: Steady. Container images must be rebuilt when model weights or the Python runtime update; that is a normal CI cost.

Multiplier: Model size. If model weights grow past what fits comfortably in a container's memory budget (low single-digit GB), we may need to split inference into a dedicated model-serving layer (Triton, Ray Serve) fronted by thin FastAPI workers. The service boundary stays the same; the implementation changes.

Verification

  • Time-to-first-caption on a cold Meeting start is under one second at p95 (observed in production latency dashboards).
  • No cold-start warm-up hack exists in the deploy pipeline (no scheduled pings, no keep-warm loop).
  • Model weights are loaded exactly once per container process, at boot.

On this page