Skip to content

Observability: Logging

What to Log

Logging is one of those things that seems simple until you're debugging a production incident at 2 AM and realize your logs are useless.

Good logging requires consistency - in field names, detail levels, and structure. Without it, debugging becomes expensive and automation becomes impossible.

Some people advocate "log everything!" That sounds great until you're drowning in noise and paying for storage you don't need. Instead, be intentional about what you log and what you don't.

Context is Everything

Compare these two log entries:

  • user login failed
  • {"user_id":"jdoe", "event":"login", "level":"warn", "timestamp":"2025-11-11T14:23:45Z", "trace_id":"4bf92f3577b34da6a3ce929d0e0e4736", "span_id":"00f067aa0ba902b7", "error":"invalid_password"}

The second one tells a story. You know who, what, when, and why. The first one just tells you something went wrong somewhere for someone.

Adding context helps you piece together what happened. But here's the catch: without unique identifiers, you'll get lost trying to track actions across services. In a microservice architecture, correlation is everything.

The OpenTelemetry trace_id serves as the primary correlation identifier - it's shared across all spans in a distributed trace, allowing you to track a request's journey through your entire system. The span_id identifies the specific operation within that trace. Together, they link your logs to distributed traces for complete request visibility.

Recommendations

1. Log to stdout

For containerized applications, always log to stdout. Let your container orchestration platform handle log collection and routing. Don't try to manage log files inside containers.

2. Use Structured Logging

Treat logs as operational data, not just debugging output. Use JSON or another structured format that you can query and analyze.

Structured logs enable:

  • Filtering by specific fields
  • Automated alerting
  • Performance analysis
  • Trend detection

3. Use the Right Log Levels

Not everything deserves the same level of attention. Here's how we use log levels:

  • Debug - Low-level diagnostic information for development and troubleshooting
  • Info - Normal business events (user logged in, order processed, etc.)
  • Warn - System working correctly, but something unusual happened that might need attention
  • Error - Something failed, but we're attempting recovery
  • Critical - Fatal errors that prevent the application from starting or functioning

Production default: Set production logging to Warn level.

Why Warn instead of Info? Because we use distributed tracing to capture normal operational flow. Traces show you the complete request journey with timing, dependencies, and context. They're designed for "everything worked" scenarios.

Logs at Warn and above surface the exceptions - things that deviate from normal operation and actually need human attention. This keeps your logs focused on actionable information instead of drowning in noise about successful operations.

The combination gives you complete observability:

  • Traces (with sampling) capture detailed request flows and performance data
  • Trace context (trace_id and span_id) ties logs and traces together when you need to investigate
  • Warn/Error/Critical logs highlight anomalies and failures that need attention

You get full visibility without paying to store logs for millions of successful requests.

4. Include Standard Fields

Every log entry should include:

  • timestamp (UTC in ISO 8601 format)
  • severity (debug, info, warn, error, critical)
  • action (what was happening)
  • component (which service/module)
  • method (which function)
  • trace_id (32 lowercase hex characters - when trace context is active)
  • span_id (16 lowercase hex characters - when trace context is active)
  • error info (if applicable)
  • input/output values (sanitized - no PII/PHI)
  • duration (for operations that take time)

The trace_id and span_id fields enable correlation with distributed traces per the OpenTelemetry/W3C Trace Context standard. Include these fields whenever trace context is active to link logs with their corresponding trace spans. See the distributed tracing documentation for full trace context details.

A Warning About Warnings

Don't overuse the warn level. Reserve it for things that actually need investigation to keep the system healthy - like invalid request content or approaching resource limits.

If it doesn't need investigation, log it as info. Too many warnings create noise, and eventually people stop paying attention to them.

Resource Limit Thresholds

When logging warnings about resource limits, use these graduated severity thresholds:

  • 75% utilization = Warn - Approaching limit, investigate soon
  • 80% utilization = Error - Critical threshold, action needed
  • 90% utilization = Critical - Imminent failure risk

Apply these thresholds to:

  • Memory - Container or process memory limit
  • CPU - CPU quota measured over 5-minute average
  • Disk - Volume capacity for persistent storage
  • Database connections - Connection pool size
  • File descriptors - Process ulimit for open files

Logging vs. Alerting: Log immediately when crossing these thresholds—this captures transient spikes for forensic analysis. However, only trigger alerts or page on-call if the condition is sustained for 5+ minutes. Brief spikes that resolve quickly shouldn't wake anyone up.

What Information to Capture

Use common sense, but follow these guidelines when deciding what to log.

Always Log These Events

  • Application startup and initialization
  • Configuration parameters (version, build tag, environment)
  • Configuration errors
  • All error and critical level events (never suppress these)
  • Shutdown requests and graceful shutdowns
  • Configuration changes (especially at runtime)

Always Include

  • When - Timestamp in UTC (ISO 8601 format)
  • What - Event name and severity level
  • Where - Hostname/container ID and component/service name
  • How long - Duration for any operation that takes meaningful time

Include When Relevant

  • Database operations - Connection info, query types (not full queries), result counts
  • Message queue operations - Topics, queues, producer/consumer IDs, message counts
  • Errors - Error type, message, stack trace (sanitized)
  • User actions - Internal user ID (UUIDs, database IDs), session ID (never log the personal information these IDs reference: usernames, emails, SSNs, etc.)

Sanitizing Stack Traces

When logging stack traces, "sanitized" means removing information that could expose system internals or sensitive data while keeping the diagnostic value.

Remove from stack traces:

  • Absolute file paths - Use relative paths from project root instead (e.g., src/handlers/auth.go:42 instead of /home/jsmith/company-app/src/handlers/auth.go:42)
  • Usernames and home directory paths - Replace with generic markers (e.g., /home/jsmith/ becomes $HOME/)
  • Environment variable values that appear in stack context
  • Memory addresses - These provide minimal debugging value and can expose internal state

Keep in stack traces:

  • Function and method names
  • Line numbers
  • Error messages
  • Relative file paths from project root
  • Call hierarchy showing the execution flow

Additional guidance:

  • Truncate stack traces to 50 frames maximum to prevent log bloat
  • Consider using structured logging libraries that handle sanitization automatically
  • For Go, libraries like github.com/pkg/errors provide stack traces that are easier to sanitize programmatically
  • Always test your sanitization logic to ensure it doesn't accidentally remove critical debugging information

Never Include

This is non-negotiable:

  • PII/PHI - Names, addresses, SSNs, medical records, phone numbers
  • Secrets - Passwords, API keys, tokens, certificates
  • Sensitive business data - Account numbers, credit cards, financial details
  • Full request/response bodies - They often contain sensitive data

For comprehensive data protection guidelines, see Data Protection.

When in doubt, don't log it. You can always add more logging later, but you can't un-log sensitive data that's already in production.

Operational Log Review

Logs are only valuable if someone looks at them. On-call engineers should review logs daily as part of standard operational duties.

Daily review focus areas:

  • Error and Critical logs - Investigate all errors and critical events, even if they didn't trigger alerts
  • Warning patterns - Look for repeated warnings that might indicate emerging problems
  • Anomalies - Unusual patterns in log volume, timing, or content
  • Resource warnings - Memory, CPU, disk, or connection pool warnings that could indicate capacity issues

This proactive review often catches problems before they become incidents. It also builds operational awareness of normal vs. abnormal system behavior.

How to review efficiently:

  • Start with the highest severity logs (Critical, then Error, then Warn)
  • Look for patterns across services, not just individual log entries
  • Use your log aggregation tool's filtering and grouping features to identify trends
  • Document any findings or concerns in your team's runbook or incident tracking system

Daily log review takes 15-30 minutes and significantly improves system reliability by catching issues early. Teams that practice this consistently see far fewer midnight and early morning fires—problems get addressed during business hours before they escalate.


← Previous: Builds and Deployments | ↑ Back to Home | Next: Observability: Distributed Tracing →