Performance

TL;DR

Performance is not "fast enough" — it is a budget, spent deliberately across every hop of a user interaction and enforced in CI. We optimise for tail latency, we design backpressure into real-time flows, and we measure the things users feel, not the things developers find convenient.

Why this matters

Users notice latency before they notice almost anything else. A transcription that renders in 800ms feels instant; at 3000ms it feels broken. The difference is not a factor of four in effort — it is a difference of whether the team thought about latency as a design constraint or as a post-hoc tuning problem. Performance handled as an afterthought is invariably more expensive than performance designed in from the start.

Our principles

1. Latency is a budget, allocated top-down

Every user-facing operation starts with a latency budget at the edge — say, 500ms — and that budget is allocated to downstream hops. If the recap fetch has 300ms and the transcript join has 150ms, the handler has 50ms of its own work. When a hop overruns its allocation, somebody else's budget gets squeezed. The budgeting view makes trade-offs explicit.

2. Measure tail latency, not average

p50 is a marketing number. p95 and p99 are what users experience. We measure and alert on the tail; we design for the tail. A system with a great median and a terrible p99 will have an awful reputation, no matter what the dashboard says.

3. Pre-compute, cache, and denormalise deliberately

When a read is hot, we pre-compute. When a computation is stable, we cache. When a join is expensive, we denormalise. Each of these trades complexity for latency; each of them earns its keep with data, not with intuition. Speculative caching is how cache-invalidation bugs become the biggest source of data incidents.

4. Backpressure is designed in, not hoped for

Every producer has a bounded queue and a defined behaviour when the queue fills: shed, coalesce, block (Real-Time). "It works fine in load tests" is not a backpressure strategy.

5. Load shedding protects the system from itself

When the system is saturated, the right behaviour is not to try harder — it is to serve fewer requests well. We shed on clearly-defined criteria: low-priority traffic first, new sessions before active ones, non-interactive before interactive. Shedding is a designed degradation mode, not an accident.

6. Hot paths have no allocations to spare

For the hottest inner loops — real-time audio processing, per-turn ingestion — we write allocation-aware code. Every allocation is a GC pause in waiting, and at high rate the pauses become the latency. Most code does not need this discipline; the hot paths demand it.

7. Profile before you optimise

Every non-trivial optimisation starts with a profile. The "obvious" bottleneck is almost always wrong, and effort spent tuning a cold path is effort wasted. We profile in production-representative conditions; profiles from developer laptops lie.

8. Budgets are enforced in CI

Bundle sizes, lighthouse scores, worst-case handler latencies — these are measured in CI against committed thresholds. A PR that regresses a budget requires an explicit, reviewed waiver. Performance regressions that slip in once slip in a hundred times; automation is cheaper than vigilance.

How we apply this

Observability — the measurement surface for latency work.
Reliability — the SLO discipline that makes performance budgets enforceable.
Frontend — the client-side performance budgets.
Real-Time — the streaming-specific patterns we apply.

Anti-patterns we reject

Optimising on hunch. No profile, no optimisation.
"It is fast on my laptop." Dev latency is not production latency. Measure in the environment that matters.
Average-as-metric. p50 is a lie. Use percentiles.
Unbounded queues. A queue without a max is a latency bomb.
Cache invalidation left to the reader. If the cache can serve stale data under a defined circumstance, that circumstance is documented. Otherwise it is a bug.
"We will fix performance later." If you ship slow, users will remember slow.

Performance

Performance

TL;DR

Why this matters

Our principles

1. Latency is a budget, allocated top-down

2. Measure tail latency, not average

3. Pre-compute, cache, and denormalise deliberately

4. Backpressure is designed in, not hoped for

5. Load shedding protects the system from itself

6. Hot paths have no allocations to spare

7. Profile before you optimise

8. Budgets are enforced in CI

How we apply this

Anti-patterns we reject

Further reading

On this page