WordloopWordloop
Engineering PrinciplesQuality

Performance

Latency budgets, tail latency, backpressure, and load shedding.

Performance

TL;DR

Performance is not "fast enough" — it is a budget, spent deliberately across every hop of a user interaction and enforced in CI. We optimise for tail latency, we design backpressure into real-time flows, and we measure the things users feel, not the things developers find convenient.

Why this matters

Users notice latency before they notice almost anything else. A transcription that renders in 800ms feels instant; at 3000ms it feels broken. The difference is not a factor of four in effort — it is a difference of whether the team thought about latency as a design constraint or as a post-hoc tuning problem. Performance handled as an afterthought is invariably more expensive than performance designed in from the start.

Our principles

1. Latency is a budget, allocated top-down

Every user-facing operation starts with a latency budget at the edge — say, 500ms — and that budget is allocated to downstream hops. If the recap fetch has 300ms and the transcript join has 150ms, the handler has 50ms of its own work. When a hop overruns its allocation, somebody else's budget gets squeezed. The budgeting view makes trade-offs explicit.

2. Measure tail latency, not average

p50 is a marketing number. p95 and p99 are what users experience. We measure and alert on the tail; we design for the tail. A system with a great median and a terrible p99 will have an awful reputation, no matter what the dashboard says.

3. Pre-compute, cache, and denormalise deliberately

When a read is hot, we pre-compute. When a computation is stable, we cache. When a join is expensive, we denormalise. Each of these trades complexity for latency; each of them earns its keep with data, not with intuition. Speculative caching is how cache-invalidation bugs become the biggest source of data incidents.

4. Backpressure is designed in, not hoped for

Every producer has a bounded queue and a defined behaviour when the queue fills: shed, coalesce, block (Real-Time). "It works fine in load tests" is not a backpressure strategy.

5. Load shedding protects the system from itself

When the system is saturated, the right behaviour is not to try harder — it is to serve fewer requests well. We shed on clearly-defined criteria: low-priority traffic first, new sessions before active ones, non-interactive before interactive. Shedding is a designed degradation mode, not an accident.

6. Hot paths have no allocations to spare

For the hottest inner loops — real-time audio processing, per-turn ingestion — we write allocation-aware code. Every allocation is a GC pause in waiting, and at high rate the pauses become the latency. Most code does not need this discipline; the hot paths demand it.

7. Profile before you optimise

Every non-trivial optimisation starts with a profile. The "obvious" bottleneck is almost always wrong, and effort spent tuning a cold path is effort wasted. We profile in production-representative conditions; profiles from developer laptops lie.

8. Budgets are enforced in CI

Bundle sizes, lighthouse scores, worst-case handler latencies — these are measured in CI against committed thresholds. A PR that regresses a budget requires an explicit, reviewed waiver. Performance regressions that slip in once slip in a hundred times; automation is cheaper than vigilance.

How we apply this

  • Observability — the measurement surface for latency work.
  • Reliability — the SLO discipline that makes performance budgets enforceable.
  • Frontend — the client-side performance budgets.
  • Real-Time — the streaming-specific patterns we apply.

Anti-patterns we reject

  • Optimising on hunch. No profile, no optimisation.
  • "It is fast on my laptop." Dev latency is not production latency. Measure in the environment that matters.
  • Average-as-metric. p50 is a lie. Use percentiles.
  • Unbounded queues. A queue without a max is a latency bomb.
  • Cache invalidation left to the reader. If the cache can serve stale data under a defined circumstance, that circumstance is documented. Otherwise it is a bug.
  • "We will fix performance later." If you ship slow, users will remember slow.

Further reading

  • Systems Performance, Brendan Gregg — the canonical reference; read the USE and RED chapters first.
  • High Performance Browser Networking, Ilya Grigorik — the frontend-and-network half of the story.
  • Latency Numbers Every Programmer Should Know (Jeff Dean) — calibrate your intuition.
  • Gil Tene, "How NOT to Measure Latency" — the talk on coordinated omission and why naive latency measurements lie.

On this page