Trace Decoration for Fun and Profit

Written by Christian Koch | Dec 22, 2025 4:18:24 PM

It's 11 PM. PagerDuty fires. Checkout latency is spiking.

You open a slow trace. Phoenix shows the route. Ecto shows 12 queries. You see that it was slow. You don't see why. Large order? Retry? Cold cache? The trace doesn't say.

This is the gap between infrastructure telemetry and business context. Decoration bridges it.

This is the third post in our observability series. Part 1 covered the three pillars. Part 2 introduced domain instrumentation and context propagation. This post is about enriching spans so they carry meaning, not just timing.

Terms

Span: A named, timed operation. The atom of distributed tracing.

Attributes: Key-value metadata attached to a span. Static context: user_id: 123, item_count: 47, tenant: "acme".

Events: Timestamped moments within a span. Dynamic markers: "cache miss", "retry attempted", "payment started".

Status: The outcome. OK or Error, with a description when things fail.

Decoration: Adding attributes, events, and status to spans. Taking "something happened" to "here's what, why, and for whom."

Trace: A collection of spans that represents a subset of the code executed for one request or action.

The Abstraction Gap

Auto-instrumentation operates at the infrastructure layer. Phoenix knows routes. Ecto knows queries. Neither knows your domain.

Your business logic lives above: orders, shipments, invoices, tenants. The concepts that matter at 2 AM aren't HTTP verbs—they're "why did this customer's order fail?"

Decoration pushes domain context into spans so traces speak your language.

Setup

The open_telemetry_decorator library wraps functions in spans with one line:

def deps do
  [
    {:opentelemetry, "~> 1.4"},
    {:opentelemetry_exporter, "~> 1.7"},
    {:open_telemetry_decorator, "~> 1.5"}
  ]
end

Configure a namespace to avoid collisions with OpenTelemetry's semantic conventions:

config :o11y, :attribute_namespace, "app"

Decorating a Function

defmodule MyApp.Checkout do
  use OpenTelemetryDecorator

  @decorate with_span("checkout.submit", include: [:user_id, :item_count, :total])
  def submit(user, cart) do
    user_id = user.id
    item_count = length(cart.items)
    total = Cart.total(cart)

    O11y.add_event("inventory_check")
    :ok = Inventory.reserve(cart.items)

    O11y.add_event("payment_started")
    {:ok, charge} = Payments.charge(user, total)
    O11y.add_event("payment_complete", %{charge_id: charge.id})

    Orders.create(user, cart, charge)
  end
end

The include option captures local variables as span attributes. Events mark phases. When things fail:

case Payments.charge(user, total) do
  {:ok, charge} ->
    {:ok, charge}
  {:error, reason} ->
    O11y.set_error(inspect(reason))
    {:error, :payment_failed}
end

What to Capture

Identifiers: user_id, order_id, tenant_id. The nouns.

Quantities: item_count, batch_size, retry_count. The scale.

States: subscription_tier, feature_flag, cache_hit. The context.

Outcomes: result, error_type, fallback_used. The answers.

Attributes describe the span as a whole. Events describe moments within it. Use attributes for "what was this request?" Use events for "where did the time go?"

Back to 11 PM

Same alert. Same spike. Different trace.

Now you see: app.item_count: 127. Events show inventory check took 800ms—127 items, 127 stock lookups. Mystery solved in 10 seconds.

GROUP BY app.item_count confirms it: every slow checkout has 50+ items. You've found your optimization target without touching a log file.

The Return

Decorated traces compound:

Search becomes useful. Find spans where app.tenant = "acme" and status = error.
Aggregations become meaningful. P99 latency for orders over $500. Retry rate by payment provider.
Debugging becomes fast. Open a trace, see the order ID, customer tier, item count, and timeline. No log correlation.
Alerting becomes precise. Fire when app.retry_count > 3, not just "errors happened."

Undecorated traces are timestamps. Decorated traces are answers.

View full post