WordloopWordloop
Engineering PrinciplesSystem Design

Data Engineering

Events, streams, CQRS, event sourcing, and the data contracts that outlive any service.

Data Engineering

TL;DR

Data outlives services. We treat every event we emit and every table we own as a long-term contract, shaped so downstream consumers — today and in three years — can work with it without archaeology. Events are append-only, schemas are versioned, and the log of what happened is preserved even when the current-state projection is rebuilt.

Why this matters

Services are replaced; data lives on. The user records, Meeting histories, and Transcriptions created by this year's Wordloop will still be in the database when the code that created them has been rewritten twice. The data contracts we set today — table shapes, event payloads, field semantics — are the single most durable thing we will produce. Getting the contract right once is cheap; changing it retroactively after the data has multiplied is brutal.

Our principles

1. Events are append-only and immutable

Once an event is emitted, it is never rewritten. Correction happens through compensating events (a "segment-deleted" event that references the original), not through mutation of the original. This is the discipline that lets downstream consumers trust the event log as a truthful history of the system.

2. Schemas are versioned and evolvable

Event payloads have explicit versions. New fields are additive; removed fields are deprecated with a deadline, not removed silently. Consumers can detect an old schema and handle it or refuse it — they are never surprised. This is the AsyncAPI discipline (API Design) applied to every stream.

3. Partition keys are chosen deliberately

Event topics partition by the identifier that matters for ordering — typically meeting_id — so that all events for a single Meeting flow through a single partition in sequence. Choosing a partition key casually is one of the most expensive mistakes in a data system; we treat it as a design decision that deserves review.

4. CQRS where it pays

For read-heavy surfaces with complex projections — the synthesis dashboard, the Meeting timeline — we maintain a read model separate from the write model. The write model owns truth; the read model owns query performance. We do not apply CQRS universally; we apply it where the read load and the write load have genuinely different shapes.

5. Event sourcing is a tool, not a religion

For domains where the history of change is itself the product — audit logs, participation timelines — we store the event log as the primary artefact and derive current state from it. For domains where current state is what matters, we store current state and publish events as derivatives. Event sourcing every table "because it is purer" is overengineering.

6. Data contracts are documented, versioned, and owned

Every significant table and every published event has an owner, a documented schema, a migration history, and a compatibility policy. Consumers find this on the Database Reference and the Events Reference. Unowned tables and undocumented events are a ticking integration-debt clock.

7. Retention is a design decision

Every dataset we store has a retention policy — deletion after N days, archival after M days, live forever. Retention is decided when the dataset is created, reviewed when the regulatory surface changes (Privacy), and enforced by automation. "We will figure it out later" is the decision that becomes a compliance incident three years later.

8. Backfills are a planned operation

Changing the shape of historical data — renaming a field, re-computing a derived column — is a project with a plan, a rollback, and a measurement. We do not backfill by running a script on a Tuesday and hoping. Backfills are rehearsed in staging and measured in production.

How we apply this

Anti-patterns we reject

  • Silent schema changes. Renaming a column in a hot table without coordinating consumers. This is how outages start.
  • Mutable event logs. Going back and "fixing" a past event. The event is what happened; the correction is a new event.
  • Kitchen-sink "events" table. One table that accepts a JSON blob for every kind of event. The type system is the best friend of a data contract; do not throw it away.
  • Backfills in production without rehearsal. See above.
  • Retention by accident. Tables that grow forever because no one considered retention at creation time.

Further reading

  • Designing Data-Intensive Applications, Martin Kleppmann — the single best survey of the territory, including the chapters on derived data, stream processing, and batch processing.
  • Data Mesh, Zhamak Dehghani — the argument for treating data as a first-class product with owners.
  • Streaming Systems, Akidau, Chernyak, Lax — the deep treatment of time, watermarks, and windowing in stream processing.
  • Event Sourcing and CQRS, Vaughn Vernon (the relevant chapters of Implementing DDD) — a grounded, implementation-focused view.

On this page