Observability
How telemetry, events, and tracing are propagated across the Wordloop platform using a Trace-First approach.
Observability
Instead of emitting fragmented logs, metrics, and traces, we generate high-cardinality, wide events (Spans) using OpenTelemetry (OTel). These spans serve as the single source of truth for the health, performance, and behavior of the entire platform.
Tracing Architecture
We utilize W3C Trace Context headers to propagate traces across every service boundary, ensuring that identity and context are never severed from the symptom.
- App (Next.js): Generates the root span for user interactions, authenticates via Clerk, and injects
clerk_user_idinto OTel Baggage asenduser.id. - Core (Go): Uses
otel/sdk/goto trace HTTP handles, Postgres queries (via pgx), and Pub/Sub publishing. It automatically reads W3C Baggage from incoming requests and propagates it via Pub/Sub attributes. - ML (Python): Uses
opentelemetry-pythonto extract spans and identity Baggage from incoming Pub/Sub messages, trace ML pipelines, and propagate context when calling Core.
Span-Derived Metrics
We do not manually instrument and roll up traditional RED (Rate, Errors, Duration) metrics at runtime. Emitting isolated metrics destroys the context necessary for debugging.
Instead, our system relies on dynamic aggregations of our wide spans. Because every span contains the exact duration, status code, and rich metadata (tenant IDs, roles), our observability backend continuously calculates and visualizes RED metrics derived directly from the trace stream. If an aggregate error rate spikes, engineers can simply click the spike to see the exact traces that generated it.
Logging
To ensure structural consistency, all logs are written as structured JSON and natively integrate the OpenTelemetry context.
- Go Logging: Implemented via
slogwith an OpenTelemetry handler. - Python Logging: Implemented via
structlognaturally wrapping the OTel context.
Every log emitted within the scope of a request automatically inherits the trace_id and span_id, allowing developers to find any application log by looking at its parent trace.
Telemetry Destinations & Sampling
Our services act purely as OTLP (OpenTelemetry Protocol) emitters. They never communicate directly with the final observability storage backend. Data routing and sampling are centrally managed.
Local Development (.NET Aspire)
Locally, all services export OTLP data to the .NET Aspire Dashboard.
- Run
./dev dash obs(or start it automatically via./dev start infra). - Access the UI at http://localhost:18888.
- You can view Traces, Metrics, and Structured Logs across all containers in real-time. Since
enduser.idBaggage is propagated, you can search for a user's exact ID to trace their entire session timeline end-to-end.
Production Pipeline & Tail-Based Sampling
In production, SDKs do not push directly to Google Cloud. We deploy instances of the OpenTelemetry Collector Gateway to act as an intermediary buffer.
Because we employ Tail-Based Sampling for financial responsibility, the Collector buffers the entire distributed trace. Once the trace is complete, the Collector executes our sampling rules:
- 100% Sampling for Errors & High Latency: If any span anywhere in the trace breaches our latency threshold or contains an error, the entire trace is preserved and exported to Google Cloud.
- 5% Sampling for Happy Paths: If the request succeeded without anomalies, we drop 95% of them at the Collector level to save ingest and storage costs without sacrificing visibility into system failures.