Observability: Logging¶
What to Log¶
Logging is one of those things that seems simple until you're debugging a production incident at 2 AM and realize your logs are useless.
Good logging requires consistency - in field names, detail levels, and structure. Without it, debugging becomes expensive and automation becomes impossible.
Some people advocate "log everything!" That sounds great until you're drowning in noise and paying for storage you don't need. Instead, be intentional about what you log and what you don't.
Context is Everything¶
Compare these two log entries:
user login failed{"user_id":"jdoe", "event":"login", "level":"warn", "timestamp":"2025-11-11T14:23:45Z", "trace_id":"4bf92f3577b34da6a3ce929d0e0e4736", "span_id":"00f067aa0ba902b7", "error":"invalid_password"}
The second one tells a story. You know who, what, when, and why. The first one just tells you something went wrong somewhere for someone.
Adding context helps you piece together what happened. But here's the catch: without unique identifiers, you'll get lost trying to track actions across services. In a microservice architecture, correlation is everything.
The OpenTelemetry trace_id serves as the primary correlation identifier - it's shared across all spans in a distributed trace, allowing you to track a request's journey through your entire system. The span_id identifies the specific operation within that trace. Together, they link your logs to distributed traces for complete request visibility.
Recommendations¶
1. Log to stdout¶
For containerized applications, always log to stdout. Let your container orchestration platform handle log collection and routing. Don't try to manage log files inside containers.
2. Use Structured Logging¶
Treat logs as operational data, not just debugging output. Use JSON or another structured format that you can query and analyze.
Structured logs enable:
- Filtering by specific fields
- Automated alerting
- Performance analysis
- Trend detection
3. Use the Right Log Levels¶
Not everything deserves the same level of attention. Here's how we use log levels:
- Debug - Low-level diagnostic information for development and troubleshooting
- Info - Normal business events (user logged in, order processed, etc.)
- Warn - System working correctly, but something unusual happened that might need attention
- Error - Something failed, but we're attempting recovery
- Critical - Fatal errors that prevent the application from starting or functioning
Production default: Set production logging to Warn level.
Why Warn instead of Info? Because we use distributed tracing to capture normal operational flow. Traces show you the complete request journey with timing, dependencies, and context. They're designed for "everything worked" scenarios.
Logs at Warn and above surface the exceptions - things that deviate from normal operation and actually need human attention. This keeps your logs focused on actionable information instead of drowning in noise about successful operations.
The combination gives you complete observability:
- Traces (with sampling) capture detailed request flows and performance data
- Trace context (
trace_idandspan_id) ties logs and traces together when you need to investigate - Warn/Error/Critical logs highlight anomalies and failures that need attention
You get full visibility without paying to store logs for millions of successful requests.
4. Include Standard Fields¶
Every log entry should include:
- timestamp (UTC in ISO 8601 format)
- severity (debug, info, warn, error, critical)
- action (what was happening)
- component (which service/module)
- method (which function)
- trace_id (32 lowercase hex characters - when trace context is active)
- span_id (16 lowercase hex characters - when trace context is active)
- error info (if applicable)
- input/output values (sanitized - no PII/PHI)
- duration (for operations that take time)
The trace_id and span_id fields enable correlation with distributed traces per the OpenTelemetry/W3C Trace Context standard. Include these fields whenever trace context is active to link logs with their corresponding trace spans. See the distributed tracing documentation for full trace context details.
A Warning About Warnings¶
Don't overuse the warn level. Reserve it for things that actually need investigation to keep the system healthy - like invalid request content or approaching resource limits.
If it doesn't need investigation, log it as info. Too many warnings create noise, and eventually people stop paying attention to them.
Resource Limit Thresholds¶
When logging warnings about resource limits, use these graduated severity thresholds:
- 75% utilization =
Warn- Approaching limit, investigate soon - 80% utilization =
Error- Critical threshold, action needed - 90% utilization =
Critical- Imminent failure risk
Apply these thresholds to:
- Memory - Container or process memory limit
- CPU - CPU quota measured over 5-minute average
- Disk - Volume capacity for persistent storage
- Database connections - Connection pool size
- File descriptors - Process ulimit for open files
Logging vs. Alerting: Log immediately when crossing these thresholds—this captures transient spikes for forensic analysis. However, only trigger alerts or page on-call if the condition is sustained for 5+ minutes. Brief spikes that resolve quickly shouldn't wake anyone up.
What Information to Capture¶
Use common sense, but follow these guidelines when deciding what to log.
Always Log These Events¶
- Application startup and initialization
- Configuration parameters (version, build tag, environment)
- Configuration errors
- All
errorandcriticallevel events (never suppress these) - Shutdown requests and graceful shutdowns
- Configuration changes (especially at runtime)
Always Include¶
- When - Timestamp in UTC (ISO 8601 format)
- What - Event name and severity level
- Where - Hostname/container ID and component/service name
- How long - Duration for any operation that takes meaningful time
Include When Relevant¶
- Database operations - Connection info, query types (not full queries), result counts
- Message queue operations - Topics, queues, producer/consumer IDs, message counts
- Errors - Error type, message, stack trace (sanitized)
- User actions - Internal user ID (UUIDs, database IDs), session ID (never log the personal information these IDs reference: usernames, emails, SSNs, etc.)
Sanitizing Stack Traces¶
When logging stack traces, "sanitized" means removing information that could expose system internals or sensitive data while keeping the diagnostic value.
Remove from stack traces:
- Absolute file paths - Use relative paths from project root instead (e.g.,
src/handlers/auth.go:42instead of/home/jsmith/company-app/src/handlers/auth.go:42) - Usernames and home directory paths - Replace with generic markers (e.g.,
/home/jsmith/becomes$HOME/) - Environment variable values that appear in stack context
- Memory addresses - These provide minimal debugging value and can expose internal state
Keep in stack traces:
- Function and method names
- Line numbers
- Error messages
- Relative file paths from project root
- Call hierarchy showing the execution flow
Additional guidance:
- Truncate stack traces to 50 frames maximum to prevent log bloat
- Consider using structured logging libraries that handle sanitization automatically
- For Go, libraries like
github.com/pkg/errorsprovide stack traces that are easier to sanitize programmatically - Always test your sanitization logic to ensure it doesn't accidentally remove critical debugging information
Never Include¶
This is non-negotiable:
- PII/PHI - Names, addresses, SSNs, medical records, phone numbers
- Secrets - Passwords, API keys, tokens, certificates
- Sensitive business data - Account numbers, credit cards, financial details
- Full request/response bodies - They often contain sensitive data
For comprehensive data protection guidelines, see Data Protection.
When in doubt, don't log it. You can always add more logging later, but you can't un-log sensitive data that's already in production.
Operational Log Review¶
Logs are only valuable if someone looks at them. On-call engineers should review logs daily as part of standard operational duties.
Daily review focus areas:
- Error and Critical logs - Investigate all errors and critical events, even if they didn't trigger alerts
- Warning patterns - Look for repeated warnings that might indicate emerging problems
- Anomalies - Unusual patterns in log volume, timing, or content
- Resource warnings - Memory, CPU, disk, or connection pool warnings that could indicate capacity issues
This proactive review often catches problems before they become incidents. It also builds operational awareness of normal vs. abnormal system behavior.
How to review efficiently:
- Start with the highest severity logs (Critical, then Error, then Warn)
- Look for patterns across services, not just individual log entries
- Use your log aggregation tool's filtering and grouping features to identify trends
- Document any findings or concerns in your team's runbook or incident tracking system
Daily log review takes 15-30 minutes and significantly improves system reliability by catching issues early. Teams that practice this consistently see far fewer midnight and early morning fires—problems get addressed during business hours before they escalate.
← Previous: Builds and Deployments | ↑ Back to Home | Next: Observability: Distributed Tracing →