Wordloop Platform
Architecture

Observability

How telemetry, events, and tracing are propagated across the Wordloop platform using a Trace-First approach.

Observability

Instead of emitting fragmented logs, metrics, and traces, we generate high-cardinality, wide events (Spans) using OpenTelemetry (OTel). These spans serve as the single source of truth for the health, performance, and behavior of the entire platform.

Tracing Architecture

We utilize W3C Trace Context headers to propagate traces across every service boundary, ensuring that identity and context are never severed from the symptom.

Rendering architecture map...
  • App (Next.js): Generates the root span for user interactions, authenticates via Clerk, and injects clerk_user_id into OTel Baggage as enduser.id.
  • Core (Go): Uses otel/sdk/go to trace HTTP handles, Postgres queries (via pgx), and Pub/Sub publishing. It automatically reads W3C Baggage from incoming requests and propagates it via Pub/Sub attributes.
  • ML (Python): Uses opentelemetry-python to extract spans and identity Baggage from incoming Pub/Sub messages, trace ML pipelines, and propagate context when calling Core.

Span-Derived Metrics

We do not manually instrument and roll up traditional RED (Rate, Errors, Duration) metrics at runtime. Emitting isolated metrics destroys the context necessary for debugging.

Instead, our system relies on dynamic aggregations of our wide spans. Because every span contains the exact duration, status code, and rich metadata (tenant IDs, roles), our observability backend continuously calculates and visualizes RED metrics derived directly from the trace stream. If an aggregate error rate spikes, engineers can simply click the spike to see the exact traces that generated it.

Logging

To ensure structural consistency, all logs are written as structured JSON and natively integrate the OpenTelemetry context.

  • Go Logging: Implemented via slog with an OpenTelemetry handler.
  • Python Logging: Implemented via structlog naturally wrapping the OTel context.

Every log emitted within the scope of a request automatically inherits the trace_id and span_id, allowing developers to find any application log by looking at its parent trace.

Telemetry Destinations & Sampling

Our services act purely as OTLP (OpenTelemetry Protocol) emitters. They never communicate directly with the final observability storage backend. Data routing and sampling are centrally managed.

Local Development (.NET Aspire)

Locally, all services export OTLP data to the .NET Aspire Dashboard.

  1. Run ./dev dash obs (or start it automatically via ./dev start infra).
  2. Access the UI at http://localhost:18888.
  3. You can view Traces, Metrics, and Structured Logs across all containers in real-time. Since enduser.id Baggage is propagated, you can search for a user's exact ID to trace their entire session timeline end-to-end.

Production Pipeline & Tail-Based Sampling

In production, SDKs do not push directly to Google Cloud. We deploy instances of the OpenTelemetry Collector Gateway to act as an intermediary buffer.

Rendering architecture map...

Because we employ Tail-Based Sampling for financial responsibility, the Collector buffers the entire distributed trace. Once the trace is complete, the Collector executes our sampling rules:

  • 100% Sampling for Errors & High Latency: If any span anywhere in the trace breaches our latency threshold or contains an error, the entire trace is preserved and exported to Google Cloud.
  • 5% Sampling for Happy Paths: If the request succeeded without anomalies, we drop 95% of them at the Collector level to save ingest and storage costs without sacrificing visibility into system failures.

On this page