Python, FastAPI, hexagonal ML architecture, model serving, and the disciplines that keep an AI runtime reliable.

ML Systems

TL;DR

wordloop-ml is a Python FastAPI service structured hexagonally, with model clients, storage, and transport as adapters around a domain of transcription, recap, and embedding. We treat model calls as external dependencies with the same rigour as any other integration — timeouts, retries, budgets, idempotency — and we evaluate model outputs as a first-class part of the test suite.

Why this matters

Machine-learning code in most organisations is a separate zoo from the rest of the backend — different language norms, different testing discipline, different release discipline. We explicitly reject this separation. An ML service is a service; it must meet the same bars for reliability, observability, and maintainability as any other. What makes ML different is what we test (model behaviour, not just code behaviour), not whether we test.

Our principles

1. FastAPI for HTTP, Uvicorn for serving, uv for everything else

FastAPI for routing and validation, Uvicorn for serving, uv for dependency management. The Python ecosystem has a hundred alternatives for each of these; we pick one combination and apply it everywhere.

2. Hexagonal from the outset

wordloop-ml has explicit domain/, ports/, adapters/, and application/ packages, enforced by import-linter rules in CI. Model clients, storage, and the FastAPI router are all adapters. The domain — transcripts, recaps, embeddings — has no model-library imports. See Hexagonal Architecture.

3. Model calls are treated as external integrations

Every call to a model is wrapped in a ModelClient port, implemented by an adapter that handles timeouts, retries with jitter, circuit breaking, and rate-limit respect. The domain never knows which provider is behind the port. Swapping providers is an adapter change — nothing more.

4. Evals are part of the test suite

We maintain an eval set for every significant model-driven behaviour — recap quality, transcription accuracy, embedding consistency — and run it in CI on any change that could affect output. Evals produce numeric scores; thresholds are committed; regressions block merge the same way a failing unit test does. "The model got a little worse" is not an acceptable landing state.

5. Prompts are code, not configuration

Prompts live in version control, are reviewed, and are tested. They are not in a runtime config that someone can edit by accident. Prompt changes go through the same PR review as code changes, and they are covered by evals.

6. Observability spans both sides of the model call

Every model call emits a trace span with input hash, prompt version, model ID, latency, token counts, and cost. An expensive prompt is visible before it is invoiced; a slow prompt is visible before it blocks a user. The model is not a black box inside our system — it is an instrumented dependency.

7. Caching and determinism are explicit

When a model call can be cached — same input, same prompt version, same model — we cache it. Determinism parameters (temperature, seed) are set explicitly per use case; "whatever the default is" is not a choice. Caching is a first-order cost-engineering lever (Cost Engineering).

8. Stateful containers, not stateless

Unlike our Go services, the ML service runs in stateful containers — models are loaded into memory on startup and kept warm for the life of the container. This is an intentional trade-off documented in an ADR; cold-starting a large model per request is not viable at our scale.

How we apply this

ML Service Handbook — the architectural walkthrough for wordloop-ml.
AI Engineering — the broader disciplines for building AI features.
Observability — how we trace and measure model calls.
Hexagonal Architecture — the structural pattern wordloop-ml follows most aggressively.

Anti-patterns we reject

Model library imports in the domain. If the domain imports openai or torch, the domain is no longer the domain.
Prompts in a runtime config. Untracked, unreviewed, unversioned prompts will drift and break evals silently.
"It is an ML service, testing is different." It is not. The tests just include evals.
Uncached expensive calls. Every call with a stable input that we pay for twice is a bug.
Model outputs trusted blindly. We validate shape, length, and content of model outputs at the adapter boundary. An unchecked model output flowing into the domain is an injection vector waiting to happen.
Synchronous long model calls on the request path. Anything that takes more than a few hundred milliseconds queues to a worker and returns a job handle.

ML Systems

ML Systems

TL;DR

Why this matters

Our principles

1. FastAPI for HTTP, Uvicorn for serving, uv for everything else

2. Hexagonal from the outset

3. Model calls are treated as external integrations

4. Evals are part of the test suite

5. Prompts are code, not configuration

6. Observability spans both sides of the model call

7. Caching and determinism are explicit

8. Stateful containers, not stateless

How we apply this

Anti-patterns we reject

Further reading

On this page