Observability: Metrics¶
Other Resources¶
What Are Metrics?¶
Metrics are numerical measurements taken over time that tell you how your system is performing right now and how it's trending.
Think of metrics as your dashboard - they answer questions like:
- How many requests per second are we handling?
- What's the 95th percentile response time?
- How much memory are we using?
- What's our error rate?
- How deep is the queue?
Unlike logs (which capture discrete events) and traces (which follow individual requests), metrics aggregate data into time-series. They're designed for dashboards, alerts, and spotting trends.
When to Expose Metrics¶
Not every application needs metrics. Here's when you should expose them and when you shouldn't.
Expose Metrics For:¶
Hosted services - APIs, web applications, microservices running 24/7
- Request rates and response times
- Error rates and types
- Resource utilization
- Business KPIs
Worker processes - Queue consumers, batch jobs, background processors
- Queue depth and processing rate
- Job duration and success/failure rates
- Throughput and backlog
- Resource consumption
Databases and data stores - Connection pools, caches, message queues
- Connection pool usage
- Query performance
- Cache hit rates
- Lock contention
Don't Bother For:¶
CLI tools - They run for seconds and then exit. Nothing to monitor.
PoCs and prototypes - They're temporary. Focus on proving the concept first.
One-off scripts - They're not long-running processes.
Development/test utilities - They're not production workloads.
The rule of thumb: if it runs continuously and serves production traffic, it needs metrics.
What to Measure¶
Don't try to measure everything - focus on what matters. Two proven approaches help you decide what to instrument.
The RED Method (Request-Driven Services)¶
For services that handle requests (APIs, web apps), measure:
- Rate - Requests per second (how busy are we?)
- Errors - Failed requests per second (what's breaking?)
- Duration - Response time distribution (how fast are we?)
These three metrics give you a nice picture of service health.
The USE Method (Resource Monitoring)¶
For system resources (CPU, memory, disk, network), measure:
- Utilization - Percentage of resource capacity used
- Saturation - How much work is queued waiting for the resource
- Errors - Resource-related errors (OOM kills, disk errors, etc.)
Business Metrics¶
Don't forget the metrics that matter to the business:
- Active users or sessions
- Transactions processed
- Revenue per hour
- Conversion rates
- Feature usage
These help you understand if technical performance translates to business value.
Prometheus Implementation¶
Prometheus has become the de facto standard for metrics in containerized environments. Here's how to implement it correctly.
The /metrics Endpoint¶
Expose metrics at /metrics in Prometheus exposition format:
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1234
http_requests_total{method="POST",status="201"} 567
# HELP http_request_duration_seconds HTTP request duration
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 100
http_request_duration_seconds_bucket{le="0.5"} 450
http_request_duration_seconds_bucket{le="1.0"} 980
http_request_duration_seconds_sum 523.4
http_request_duration_seconds_count 1000
Metric Types¶
Choose the right type for what you're measuring:
Counter - A value that only goes up (requests handled, errors encountered)
- Use for: request counts, error counts, bytes transferred
- Never use for: things that can decrease
Gauge - A value that can go up or down (current memory usage, active connections)
- Use for: temperatures, queue depth, current resource usage
- Can be set to any value at any time
Histogram - Samples observations and counts them in buckets (request duration, response size)
- Use for: response times, request sizes
- Automatically provides count, sum, and quantiles
Summary - Similar to histogram but calculates quantiles on the client side
- Use sparingly - histograms are usually better
- More expensive computationally
Naming Conventions¶
Follow Prometheus naming conventions to keep metrics consistent:
- Use
snake_casefor metric names - Start with the application or domain name:
http_requests_total,queue_depth - End counters with
_total:http_requests_total,errors_total - End duration metrics with
_seconds:request_duration_seconds - End size metrics with appropriate units:
response_size_bytes
Good names:
api_http_requests_totalworker_job_duration_secondscache_hits_total
Bad names:
RequestCount(use snake_case)duration(too vague, missing unit)http_requests(counters should end with_total)
Labels: Use Wisely¶
Labels let you slice metrics by dimensions (method, status, endpoint). But be careful - every unique combination of labels creates a new time series.
Good label usage:
Dangerous label usage (cardinality explosion):
If you have 10,000 users, you just created 10,000 time series for one metric. That's expensive and will kill your Prometheus server.
Cardinality rules:
- Labels should have bounded, finite values
- Never use IDs or unbounded strings as labels
- Typical safe labels: method, status code, endpoint (limited set), service name
- Dangerous labels: user IDs, email addresses, session IDs, request IDs
How Metrics Complement Logs and Traces¶
Metrics, logs, and traces each serve different purposes. Use them together for complete observability.
Metrics answer "What's happening now?"
- Current request rate, error rate, latency
- Resource utilization trends
- System health at a glance
- Trigger alerts when thresholds are crossed
Logs answer "What went wrong?"
- Specific errors and exceptions
- Anomalies that need investigation
- Business rule violations
- Audit trail of important events
Traces answer "Where did it go?"
- Complete request journey through services
- Which service is slow?
- Where did the error originate?
- Dependencies and timing breakdown
Example Investigation Workflow¶
- Alert fires (from metrics): "Error rate > 5%"
- Check metrics dashboard: Which endpoint? What status codes?
- Search logs: What errors are being logged? Any patterns?
- Pull traces: For failed requests, what's the full story?
- Root cause found: Database connection timeout on specific query
Each layer provides different insight. You need all three.
Common Pitfalls¶
Cardinality Explosion¶
This is the #1 way to kill your metrics system.
The problem: Every unique label combination creates a new time series. Add a label with 1,000,000 possible values? That's 1,000,000 time series from a single metric -- and your Prometheus server will not be happy about it.
The solution:
- Keep label cardinality bounded (status codes: ~20 values, not user IDs: 1,000,000 values)
- Use sampling or aggregation for high-cardinality data
- Monitor your metrics system's resource usage
Over-Instrumenting¶
Don't measure everything just because you can. Every metric has a cost:
- Memory in your application
- Network bandwidth to send metrics
- Storage in Prometheus
- Query time on dashboards
Start with RED/USE, then add business metrics. Add more only when you have specific questions to answer.
Coverage vs. Depth:
While you should avoid over-instrumenting individual services, every production service needs core metrics. See Scale & High Availability for why monitoring coverage matters. The goal is complete coverage with focused instrumentation—not metrics everywhere or metrics nowhere.
Under-Instrumenting¶
The opposite problem: you don't have the metrics you need when something breaks.
Critical metrics you can't skip:
- Request rate and error rate (for services)
- Response time distribution (p50, p95, p99)
- Resource usage (CPU, memory, connections)
- Queue depth (for async systems)
If you're oncall and can't answer "Is the service healthy?" from your dashboards, you're under-instrumented.
Forgetting About Storage Costs¶
Metrics aren't free. With default Prometheus settings:
- 1,000 active time series = ~1-2 MB/hour of storage
- 1,000,000 active time series = ~1-2 GB/hour of storage
High cardinality labels can quickly make metrics expensive. Monitor your time series count and set retention policies appropriately.
← Previous: Observability: Distributed Tracing | ↑ Back to Home | Next: Observability: Health Checks →