In Why You Need Telemetry, I made the case for observability. Metrics tell you something's wrong. Logs tell you why. Traces show you the path a request took through your system.
Since then, we've shipped OpenTelemetry across two production applications. The experience taught us something the getting-started guides skip over: getting traces is easy. Getting useful traces requires two patterns that aren't obvious from the documentation.
A Quick Refresher on Traces
A trace represents a single operation as it moves through your system. It's composed of spans—named, timed blocks that represent units of work. A parent span might be an HTTP request. Child spans might be database queries, external API calls, or function executions within that request.
Each span can carry attributes: key-value pairs that add context. A database span might include the SQL query. An HTTP span might include the status code.
When you add OpenTelemetry libraries for your web framework and database, you get auto-instrumentation. These libraries hook into your stack and generate spans automatically. No code changes required.
This is where most guides stop. It's also where the real work begins.
Pattern 1: Instrument Your Domain, Not Just Your Infrastructure
Auto-instrumentation captures infrastructure: HTTP request boundaries, database query timing, external service calls. This is useful for performance analysis. It's not useful for debugging business logic.
Here's the problem. When something goes wrong in production, you rarely ask "how long did this database query take?" You ask "what happened to order 12345?" or "why did this user's request fail?"
Auto-instrumentation can't answer these questions. It doesn't know about orders or users or business operations. It only knows about HTTP and SQL.
Consider what a trace looks like with auto-instrumentation alone:
POST /gql → 200 OK (245ms)
└── SELECT * FROM requisitions (12ms)
The request succeeded. A query ran. That's all you know.
Now the same request with domain instrumentation—custom spans and attributes that reflect your application's business logic:
POST /gql → 200 OK (245ms)
├── user.id=123, account_id=456
├── graphql.operation=updateRequisitionStatus
└── requisitions.set_status
├── requisition_id=789
├── previous_status=draft → target_status=submitted
└── [event] status_changed
Same request. Far more information. You can now search your traces for a specific requisition ID. You can see state transitions. You can answer "what happened?" without grepping logs or querying the database.
How to Add Domain Instrumentation
The mechanics are straightforward. When you enter a significant business operation, create a span. Attach attributes that provide context. Record events for important moments within the operation.
@decorate trace("requisitions.set_status")
def set_status(scope, requisition, status) do
set_attributes(%{
"requisition_id" => requisition.id,
"account_id" => requisition.account_id,
"previous_status" => requisition.status,
"target_status" => status
})
# ... perform the status change ...
add_event("status_changed", %{
"from" => previous_status,
"to" => status
})
end
Three things make the difference:
Span names should reflect business operations, not technical endpoints. requisitions.set_status tells you more than POST /gql.
Attributes should include entity IDs and state. When debugging, you'll search by these values. requisition_id=789 lets you find every trace involving that entity.
Events mark moments within a span. A single operation might have multiple important moments: validation passed, external service called, state changed. Events capture these without creating separate spans.
You don't need to instrument everything. Start with your critical business operations—the paths where bugs cost you money or customers.
Pattern 2: Propagate Context Across Async Boundaries
Here's a scenario. A user submits a form. Your application validates the input, saves to the database, and enqueues a background job to send a confirmation email. The job runs, calls an email service, and fails.
With auto-instrumentation, you get two traces: one for the HTTP request, one for the background job. They're not connected. When the email fails, you can't easily answer "which user action triggered this?"
This happens because of async boundaries—places where your code hands off work to another process. Background job queues, spawned tasks, message brokers, webhook dispatches. Each async boundary starts a new trace unless you explicitly carry context forward.
Trace context is the mechanism that connects spans across boundaries. It's a set of identifiers (trace ID, span ID, and some flags) that tell the tracing system "this work is part of that earlier operation."
When you enqueue a background job, the job runs in a separate process, possibly on a separate machine, possibly minutes later. It has no way to know it's related to the original HTTP request unless you tell it.
How to Propagate Context
The pattern has two steps: inject context when crossing the boundary, extract it on the other side.
When enqueueing a job, inject the current trace context into the job's metadata:
defp add_trace_context(changeset) do
meta = Changeset.get_field(changeset, :meta, %{})
# Inject current trace context as key-value pairs
trace_headers = :otel_propagator_text_map.inject([])
Changeset.change(changeset, %{meta: Map.merge(meta, trace_headers)})
end
When the job executes, extract that context before doing work:
def handle_job(job) do
# Extract trace context from job metadata
:otel_propagator_text_map.extract(job.meta)
# Now any spans created here will be children of the original trace
do_work(job)
end
The inject call serializes the current trace context into a map. The extract call deserializes it and makes it the active context. Any spans created after extraction appear as children of the original request's trace.
This pattern applies everywhere your code goes async:
- Background jobs: Inject into job metadata at enqueue time
- Spawned processes: Capture context before spawning, attach it inside the new process
- Message queues: Include context in message headers
- Webhooks: Pass trace headers to downstream services
Each unhandled boundary is a gap in your traces—a place where the story breaks.
Why This Matters
Auto-instrumentation is necessary but not sufficient. It tells you that requests happened, queries ran, services responded. It doesn't tell you what your application did or why.
Domain instrumentation closes this gap. It captures the business context—entity IDs, state transitions, operation names—that you actually need when debugging.
Context propagation ensures that traces tell complete stories. When a background job fails, you can trace it back to the user action that started the chain.
Together, these patterns transform traces from performance metrics into debugging tools. The next time something goes wrong at 2 AM, you'll spend less time correlating logs and more time fixing the problem.