Observability
Unified systems observability through trace-first development.
Observability
At WordLoop, we embrace a unified approach to understanding our systems: Trace-First Development.
By treating observability as a single, contiguous stream of trace data, we empower our engineers to debug faster and understand system behavior more deeply. We maintain a single source of truth that inherently preserves the execution context of every action taken within our platform.
Wide, Structured Events
We track unit-of-work executions as single, wide events known as Spans.
When an operation or outbound HTTP call is wrapped in a span, it inherently records:
- Duration (Start and End times)
- Status (OK vs Error)
- High-Cardinality Context (Tenant IDs, User Tiers, exact input shapes)
Our observability backend aggregates these wide spans automatically to visualize system health. If an error rate spikes or latency increases, we never have to guess why the metric changed—we simply click the spike and instantly view the exact spans that generated it. The context is never severed from the symptom.
OpenTelemetry as the Standard
We utilize OpenTelemetry (OTel) across all parts of the WordLoop platform—from the frontend browser to the Go API, down to the deepest Python ML pipeline execution.
- Vendor Neutrality: We emit standard OTel data natively. Our choice of observability backend is a deployment configuration, not a code change.
- Identity Everywhere (W3C Baggage): Our services do not act in isolation. A user clicking "Transcribe" generates a request that cascades across the system. The
enduser.idand other critical context parameters must ride along via headers and metadata automatically. - Trace Context Propagation: Every inter-service hop—whether over HTTP or through Pub/Sub queues—must carry the
traceparentheader to stitch the entire distributed execution graph together.
Tail-Based Sampling & Financial Responsibility
Capturing 100% of telemetry data is an anti-pattern at scale; telemetry ingest costs can quickly outpace compute costs. We act with financial responsibility by leveraging Tail-Based Sampling at the collector level.
Rather than making a blind decision to drop traces at the source (Head-Based Sampling), our system evaluates the entire trace after the request finishes.
- Errors & Latency (100% sampled): If a trace contains an error, exception, or breaches a latency threshold, the collector guarantees it is passed to our backend so developers have the data they need to debug.
- Happy Paths (5% sampled): If a trace runs perfectly without anomalies, 95% of them are safely discarded at the networking edge, saving immense bandwidth and storage costs without sacrificing visibility into failure states.
Build It In, Don't Bolt It On
Observability is a feature. Just as we write unit tests for business logic, we actively instrument code with spans, tags, and context as we write it. If a developer cannot see exactly what their code is doing in production without resorting to print statements, the feature is not finished.