WordloopWordloop
WorkMeeting RecordingTechnical Design DocContracts

Audio

Binary audio transport — frame formats, chunk storage, ML forwarding, acknowledgement, and backpressure.

Audio

Audio chunks flow from the browser through Core to ML as binary WebSocket frames. This page covers the frame formats for both hops, chunk-based GCS storage, ML acknowledgement, and backpressure signalling. For the recording lifecycle (start/stop/resume commands and events), see Recording. For shared connection semantics, see Infrastructure.

Browser → Core: Binary Audio Frame

Audio chunks are sent as binary WebSocket frames using a length-prefixed metadata envelope followed by raw audio bytes.

uint32_be metadata_length
utf8_json metadata
raw_audio_bytes

Metadata schema:

{
  "type": "com.wordloop.recording.audio_chunk.v1",
  "id": "chunk-event-uuid",
  "traceparent": "00-...",
  "meeting_id": "meeting-uuid",
  "sequence": 1842,
  "started_at_ms": 184200,
  "duration_ms": 100,
  "mime_type": "audio/webm",
  "crc32": "hex-encoded-crc32"
}

Core verifies the CRC32 checksum, stores the chunk by sequence number in GCS, enriches the metadata with ml_session_id, forwards the frame to ML over the ML WebSocket, and records the highest contiguous sequence. Duplicate sequences are acknowledged but not re-stored.

Chunk-Based GCS Storage

Each audio chunk is stored as a separate GCS object keyed by sequence number: meetings/{id}/chunks/{seq:08d}.webm. WebM encodes its EBML header in the first chunk; subsequent chunks contain raw Cluster data. This structure enables gap recovery — any chunk missed due to a connectivity failure can be backfilled from OPFS by sequence number. At session end, Core composes the chunk objects into the final audio.webm using GCS Compose — hierarchically in groups of ≤32 for recordings that exceed GCS's 32-object compose limit.

OPFS Shadow Buffer

Every audio chunk is simultaneously written to an always-on shadow buffer maintained by a dedicated Web Worker using the Origin Private File System (OPFS) createSyncAccessHandle() API. Each chunk carries a monotonically incrementing sequence number assigned in the browser. This buffer runs unconditionally — it captures audio regardless of Core or GCS connectivity. It is cleared only after Core confirms all chunks are safely in GCS.

OPFS Chunk Storage Format

Each chunk is stored in OPFS with an integrity envelope so corrupted chunks can be detected during gap recovery:

uint32_be crc32
uint32_be audio_length
raw_audio_bytes

The CRC32 is computed over the raw audio bytes. On read (during gap recovery), the reader verifies the CRC32 before uploading. Chunks that fail verification are skipped — the post-meeting batch transcription will handle any resulting audio gaps.

Core → ML: Binary Audio Frame

Core enriches the browser's binary frame with ml_session_id before forwarding to ML. The binary framing structure is identical (length-prefixed metadata + raw audio), but the metadata schema differs from the browser→Core frame.

uint32_be metadata_length
utf8_json metadata
raw_audio_bytes

Metadata schema:

{
  "type": "com.wordloop.ml.audio_chunk.v1",
  "id": "chunk-event-uuid",
  "traceparent": "00-...",
  "meeting_id": "meeting-uuid",
  "ml_session_id": "ml-session-uuid",
  "sequence": 1842,
  "started_at_ms": 184200,
  "duration_ms": 100,
  "mime_type": "audio/webm",
  "crc32": "hex-encoded-crc32"
}

ML acknowledges processed audio progress through AudioChunkAckEvent, not per-frame WebSocket acks. This avoids chatty acknowledgements while still letting Core detect lag.

ML → Core: AudioChunkAckEvent

Reports processed audio progress. Core uses this for diagnostics and backpressure decisions.

{
  "specversion": "1.0",
  "id": "event-uuid",
  "source": "wordloop-ml/ws",
  "type": "com.wordloop.ml.audio_chunk.ack.v1",
  "time": "2026-05-01T09:03:05Z",
  "traceparent": "00-...",
  "data": {
    "meeting_id": "meeting-uuid",
    "last_sequence_received": 1842,
    "last_sequence_processed": 1841
  }
}

Backpressure

ML → Core: BackpressureEvent

Tells Core that ML is falling behind. Core continues storing audio to GCS and may degrade live insights while preserving the recording.

{
  "specversion": "1.0",
  "id": "event-uuid",
  "source": "wordloop-ml/ws",
  "type": "com.wordloop.ml.backpressure.v1",
  "time": "2026-05-01T09:05:00Z",
  "traceparent": "00-...",
  "data": {
    "meeting_id": "meeting-uuid",
    "reason": "provider_latency",
    "retry_after_ms": 1000,
    "queue_depth": 128
  }
}

ML → Core: BackpressureClearedEventNew

Explicitly signals that ML has recovered from backpressure. Without this, Core must infer recovery from the absence of further BackpressureEvent messages or from AudioChunkAckEvent progress, which makes Core's state machine ambiguous.

{
  "specversion": "1.0",
  "id": "event-uuid",
  "source": "wordloop-ml/ws",
  "type": "com.wordloop.ml.backpressure_cleared.v1",
  "time": "2026-05-01T09:05:30Z",
  "traceparent": "00-...",
  "data": {
    "meeting_id": "meeting-uuid",
    "queue_depth": 0
  }
}

Client-Side Backpressure

Core does not send an explicit backpressure event to the browser. Instead, the client monitors WebSocket.bufferedAmount on the Core-facing connection. If bufferedAmount exceeds a configurable threshold (default: 5 MB), the client pauses MediaRecorder output and queues chunks in the OPFS shadow buffer only. When bufferedAmount drops below the resume threshold (default: 1 MB), the client resumes sending. This uses the browser's native WebSocket flow control rather than adding a custom protocol-level backpressure mechanism.

On this page