Skip to content

Observability: Distributed Tracing

Other Resources

OpenTelemetry Instrumentation Guides

What is Distributed Tracing?

We rarely have an isolated singular service. Services use other services; a single user request might touch 5, 10, or even 20 different services. Distributed tracing tracks that request's entire journey, showing you exactly where time is spent and where things go wrong. Think of it like GPS tracking for your requests - you can see the complete path from start to finish, including any detours or traffic jams along the way. Without distributed tracing, debugging microservices is like trying to solve a mystery with half the clues missing. You might know something went wrong, but good luck figuring out where or why.

Distributed tracing gives you:

  • End-to-end visibility: See exactly how requests flow through your system
  • Performance insights: Spot slow services, database calls, or network issues
  • Error context: When something breaks, see exactly where and what caused it
  • Dependency mapping: Understand which services depend on what

For tracing to work, each service needs to pass the trace context to the next service. This happens automatically with most modern tracing libraries - they add headers to HTTP requests, message queue messages, etc.

Trace Context

OpenTelemetry uses two core identifiers to correlate distributed requests:

  • trace_id: A unique identifier for the entire request journey (32 lowercase hex characters)
  • span_id: A unique identifier for each operation within that trace (16 lowercase hex characters)

These are your correlation IDs -- no need to invent separate ones. They tie logs, traces, and metrics together across all services involved in a request.

OpenTelemetry follows the W3C Trace Context standard for propagation. Services pass trace context via the traceparent HTTP header using the format:

{version}-{trace_id}-{span_id}-{trace_flags}

Example:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

When a service receives this header, it continues the same trace by creating child spans with the same trace_id but new span_id values. This creates the parent-child relationship that builds your complete trace tree.

For more details on OpenTelemetry traces, see the OpenTelemetry Traces documentation.

What to Trace

Always Trace

  • External calls: HTTP requests, database queries, message publishing
  • Service boundaries: Incoming requests and outgoing responses
  • Critical business operations: Payment processing, user authentication, etc.

Consider Tracing

  • Expensive operations: File I/O, complex calculations, third-party API calls
  • Error-prone areas: Places where things often go wrong
  • Performance bottlenecks: Operations you're trying to optimize

Defining "Expensive Operations"

What counts as "expensive" varies by operation type. These thresholds are industry-standard guidelines to help you decide when custom tracing adds value:

Operation Type Threshold Source
User-facing API response >200ms Google SRE Workbook
Database query >100ms PostgreSQL Wiki
Cache operation (Redis) >10ms Redis Latency Docs
External API call >500ms AWS Well-Architected
Memory-intensive operation >50MB Industry practice
All errors Always trace OpenTelemetry Best Practices

These are starting points, not rigid rules. Adjust based on your service's performance profile and user expectations. A background batch job might have different thresholds than a real-time API.

Don't Over-Trace

Tracing adds overhead, so be selective. You don't need to trace every function call - focus on the operations that matter for debugging and performance monitoring.

Implementation with OpenTelemetry

OpenTelemetry is the industry standard for distributed tracing. It works with all major programming languages and integrates with most observability platforms.

Auto-Instrumentation

Most languages offer auto-instrumentation that automatically traces common operations like:

  • HTTP requests (both incoming and outgoing)
  • Database calls
  • Message queue operations
  • Popular framework operations

Start with auto-instrumentation - it gives you 80% of what you need with minimal effort.

Custom Instrumentation

For business-specific operations, add custom spans around important code sections. This is where you'll trace things like:

  • Complex business logic
  • Internal service calls
  • Custom integrations

Sampling

Don't trace every single request - it's expensive and usually unnecessary. Use sampling to trace a representative subset:

  • Head-based sampling: Decide at the start of a trace
  • Tail-based sampling: Decide after seeing the complete trace (useful for keeping all error traces)

Start with a 10% sampling and adjust based on your traffic volume and needs.

Sampling Rate Adjustment Criteria:

  • 10% sampling: Default for traffic under 1,000 requests per minute
  • 1-5% sampling: For traffic between 1,000-10,000 requests per minute
  • 0.1-1% sampling: For traffic above 10,000 requests per minute
  • 100% error traces: Always capture traces that contain errors regardless of sampling rate

Review Cadence:

  • Review sampling rates monthly under normal operations
  • After a release, review sampling effectiveness daily for 5 days to ensure you're capturing enough data to identify issues without overwhelming your tracing infrastructure

Span Naming

  • Use descriptive, consistent names: GET /users/{id} not GET /users/12345
  • Include the operation type: db.query.select_user
  • Keep names stable - don't include dynamic values

How Tracing and Logging Work Together

Tracing and logging serve different purposes and complement each other:

Traces are for normal operations. When everything works as expected, traces capture the complete request flow: which services were called, how long each step took, what dependencies were involved. This is your "everything worked" data.

Logs are for anomalies. When something unusual happens - a business rule violation, an approaching resource limit, an unexpected error - logs surface it for human attention.

This is why we set production log level to Warn instead of Info. We don't need to log every successful request because traces already capture that information. Logs focus on deviations from normal operation.

Trace context ties them together. When a trace is active, your logs should include the trace_id and span_id fields. This correlation mechanism lets you:

  1. Find the relevant log entry (the anomaly that needs attention)
  2. Pull up the associated trace using the trace_id (the full context of what happened)
  3. Understand both what went wrong and why

Most OpenTelemetry SDKs automatically inject trace_id and span_id into your logging context when using structured logging. See the Observability: Logging guide for details on structured log format and trace correlation.

This approach gives you complete observability while keeping costs reasonable. You're not paying to store logs for millions of successful requests when traces already capture that data more efficiently.


← Previous: Observability: Logging | ↑ Back to Home | Next: Observability: Metrics →