Skip to content

Observability: Heartbeats, Readiness, and Liveness

Other Resources

What is a Heartbeat?

Every service needs a way to answer the question "Are you okay?" That's what heartbeat endpoints do.

But a good heartbeat does more than just respond with a 200. It checks that the service can actually do its job - can it reach the database? Are its dependencies available? Is it configured correctly?

This information serves multiple purposes:

  • Auto-healing - Kubernetes can restart unhealthy containers
  • Auto-scaling - Only send traffic to healthy instances
  • Troubleshooting - Ops teams can quickly identify what's broken
  • Alerting - Automated monitoring can catch problems before users notice

The heartbeat endpoint doubles as readiness and liveness checks in Kubernetes, so design it carefully.

Unified Endpoint Design

While Kubernetes supports separate readiness and liveness probes, we intentionally use a single /ops/health endpoint for both. This simplifies implementation while still providing Kubernetes with the information it needs through HTTP status codes (200 for healthy, 503 for unhealthy). Kubernetes uses these status codes to make appropriate routing and restart decisions.

How to Define the health check

Every service exposes a heartbeat endpoint at /ops/health. Here's the standard OpenAPI spec we use.

openapi: 3.0.3
info:
  title: Service Heartbeat API
  description: Standard health check endpoint for all services
  version: 1.0.0
paths:
  /ops/health:
    get:
      summary: Get service health status
      description: Returns the current health status of the service and its dependencies
      responses:
        "200":
          description: Service health information
          content:
            application/json:
              schema:
                $ref: "#/components/schemas/HealthResponse"
        "503":
          description: Service unavailable or critical health issue
          content:
            application/json:
              schema:
                $ref: "#/components/schemas/HealthResponse"
components:
  schemas:
    HealthResponse:
      type: object
      required:
        - Status
        - URL
        - Machine
        - UtcDateTime
      properties:
        Status:
          type: string
          enum: [NotSet, OK, Warning, Critical]
          description: Overall health status
        Message:
          type: string
          description: Optional details about current health condition
        URL:
          type: string
          format: uri
          description: The actual endpoint URL that responded
        Machine:
          type: string
          description: Hostname or container name (never IP address)
        UtcDateTime:
          type: string
          format: date-time
          description: ISO 8601 timestamp when response was generated
        RequestDuration:
          type: number
          minimum: 0
          description: How long the health check took (milliseconds)
        Dependencies:
          type: array
          description: Health status of direct dependencies
          items:
            $ref: "#/components/schemas/DependencyHealth"
    DependencyHealth:
      type: object
      required:
        - Status
        - URL
        - UtcDateTime
      properties:
        Status:
          type: string
          enum: [NotSet, OK, Warning, Critical]
        URL:
          type: string
          format: uri
        UtcDateTime:
          type: string
          format: date-time
        RequestDuration:
          type: number
          minimum: 0

Health Status Determination

The health status enum maps to resource utilization thresholds defined in Observability: Logging:

Resource Utilization Health Status
< 75% OK
75% - 80% Warning
> 80% Critical

Apply these thresholds when checking memory, CPU, disk, database connections, and file descriptors. The overall health status should reflect the worst status among all checked resources and dependencies.


← Previous: Observability: Metrics | ↑ Back to Home | Next: Flexible Application Configuration →