ML Systems
Python, FastAPI, hexagonal ML architecture, model serving, and the disciplines that keep an AI runtime reliable.
ML Systems
TL;DR
wordloop-ml is a Python FastAPI service structured hexagonally, with model clients, storage, and transport as adapters around a domain of transcription, recap, and embedding. We treat model calls as external dependencies with the same rigour as any other integration — timeouts, retries, budgets, idempotency — and we evaluate model outputs as a first-class part of the test suite.
Why this matters
Machine-learning code in most organisations is a separate zoo from the rest of the backend — different language norms, different testing discipline, different release discipline. We explicitly reject this separation. An ML service is a service; it must meet the same bars for reliability, observability, and maintainability as any other. What makes ML different is what we test (model behaviour, not just code behaviour), not whether we test.
Our principles
1. FastAPI for HTTP, Uvicorn for serving, uv for everything else
FastAPI for routing and validation, Uvicorn for serving, uv for dependency management. The Python ecosystem has a hundred alternatives for each of these; we pick one combination and apply it everywhere.
2. Hexagonal from the outset
wordloop-ml has explicit domain/, ports/, adapters/, and application/ packages, enforced by import-linter rules in CI. Model clients, storage, and the FastAPI router are all adapters. The domain — transcripts, recaps, embeddings — has no model-library imports. See Hexagonal Architecture.
3. Model calls are treated as external integrations
Every call to a model is wrapped in a ModelClient port, implemented by an adapter that handles timeouts, retries with jitter, circuit breaking, and rate-limit respect. The domain never knows which provider is behind the port. Swapping providers is an adapter change — nothing more.
4. Evals are part of the test suite
We maintain an eval set for every significant model-driven behaviour — recap quality, transcription accuracy, embedding consistency — and run it in CI on any change that could affect output. Evals produce numeric scores; thresholds are committed; regressions block merge the same way a failing unit test does. "The model got a little worse" is not an acceptable landing state.
5. Prompts are code, not configuration
Prompts live in version control, are reviewed, and are tested. They are not in a runtime config that someone can edit by accident. Prompt changes go through the same PR review as code changes, and they are covered by evals.
6. Observability spans both sides of the model call
Every model call emits a trace span with input hash, prompt version, model ID, latency, token counts, and cost. An expensive prompt is visible before it is invoiced; a slow prompt is visible before it blocks a user. The model is not a black box inside our system — it is an instrumented dependency.
7. Caching and determinism are explicit
When a model call can be cached — same input, same prompt version, same model — we cache it. Determinism parameters (temperature, seed) are set explicitly per use case; "whatever the default is" is not a choice. Caching is a first-order cost-engineering lever (Cost Engineering).
8. Stateful containers, not stateless
Unlike our Go services, the ML service runs in stateful containers — models are loaded into memory on startup and kept warm for the life of the container. This is an intentional trade-off documented in an ADR; cold-starting a large model per request is not viable at our scale.
How we apply this
- ML Service Handbook — the architectural walkthrough for
wordloop-ml. - AI Engineering — the broader disciplines for building AI features.
- Observability — how we trace and measure model calls.
- Hexagonal Architecture — the structural pattern
wordloop-mlfollows most aggressively.
Anti-patterns we reject
- Model library imports in the domain. If the domain imports
openaiortorch, the domain is no longer the domain. - Prompts in a runtime config. Untracked, unreviewed, unversioned prompts will drift and break evals silently.
- "It is an ML service, testing is different." It is not. The tests just include evals.
- Uncached expensive calls. Every call with a stable input that we pay for twice is a bug.
- Model outputs trusted blindly. We validate shape, length, and content of model outputs at the adapter boundary. An unchecked model output flowing into the domain is an injection vector waiting to happen.
- Synchronous long model calls on the request path. Anything that takes more than a few hundred milliseconds queues to a worker and returns a job handle.
Further reading
- Designing Machine Learning Systems, Chip Huyen — the systems view of production ML.
- Evaluating and Reinforcing LLM Behaviors, Shreya Shankar et al. — the canonical treatment of eval design.
- FastAPI documentation — read the dependency-injection and pydantic chapters closely.
- The Twelve-Factor App — the ML service still respects all twelve, especially config, logs, and dependencies.