Reliability

Docker Health Checks: Ensuring Container Reliability in Production

February 6, 2025 · 12 min read

A container that appears to be running is not necessarily a container that is working. Your web server process might be alive but stuck in an infinite loop. Your database container might show as "Up 3 days" while the actual database engine crashed internally hours ago. This is the fundamental problem that Docker health checks solve: they let you define what "healthy" actually means for your application, going far beyond simple process monitoring.

In production environments, health checks are the foundation of reliable container infrastructure. They enable orchestrators to make intelligent decisions about routing traffic, replacing failed containers, and rolling out updates safely. This guide covers everything you need to know to implement effective health checks across your container fleet.

Understanding the HEALTHCHECK Instruction

The HEALTHCHECK instruction in a Dockerfile tells Docker how to test whether your container is still working correctly. When a health check is configured, Docker periodically runs the specified command inside the container and uses the exit code to determine the container's health status.

There are three possible health states:

starting — The container has started but hasn't passed its first health check yet
healthy — The health check command returned exit code 0
unhealthy — The health check command returned a non-zero exit code for the configured number of consecutive retries

The basic syntax in a Dockerfile looks like this:

HEALTHCHECK [OPTIONS] CMD command

The available options control timing and retry behavior:

# Full syntax with all options
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --start-interval=5s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

Here is what each option does:

Option	Default	Description
`--interval`	30s	Time between health check executions
`--timeout`	30s	Maximum time a check can run before being considered failed
`--start-period`	0s	Grace period for containers that need initialization time
`--start-interval`	5s	Interval between checks during the start period (Docker 25+)
`--retries`	3	Consecutive failures needed before marking unhealthy

Tip: The --start-period is critical for applications with long startup times, like Java applications or databases. During this period, failed health checks don't count toward the retry limit, but a successful check marks the container as healthy immediately.

Writing Effective Health Check Commands

The simplest health checks use curl or wget to hit a local endpoint, but there are many approaches depending on your application type.

HTTP-Based Health Checks

For web applications and APIs, hitting a dedicated health endpoint is the most common pattern:

# Simple HTTP check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1

# Using wget (useful for Alpine-based images without curl)
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:8080/healthz || exit 1

Your health endpoint should test actual application readiness, not just return 200 OK. A good health endpoint checks database connections, cache availability, and critical dependencies:

# Node.js health endpoint example
app.get('/health', async (req, res) => {
  try {
    // Test database connection
    await db.query('SELECT 1');
    // Test Redis connection
    await redis.ping();
    res.status(200).json({ status: 'healthy', uptime: process.uptime() });
  } catch (err) {
    res.status(503).json({ status: 'unhealthy', error: err.message });
  }
});

TCP-Based Health Checks

For services that don't expose HTTP endpoints, you can check TCP connectivity:

# Check if a port is accepting connections
HEALTHCHECK --interval=15s --timeout=5s --retries=3 \
  CMD nc -z localhost 5432 || exit 1

# Alternative using bash built-in
HEALTHCHECK --interval=15s --timeout=5s --retries=3 \
  CMD bash -c '</dev/tcp/localhost/6379' || exit 1

Database-Specific Health Checks

Databases deserve health checks that verify the engine is actually processing queries:

# PostgreSQL
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD pg_isready -U postgres -d mydb || exit 1

# MySQL / MariaDB
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD mysqladmin ping -h localhost -u root --password="$MYSQL_ROOT_PASSWORD" || exit 1

# Redis
HEALTHCHECK --interval=15s --timeout=5s --retries=3 \
  CMD redis-cli ping | grep -q PONG || exit 1

# MongoDB
HEALTHCHECK --interval=30s --timeout=10s --start-period=30s --retries=3 \
  CMD mongosh --eval "db.adminCommand('ping')" --quiet || exit 1

Custom Health Check Scripts

For complex health requirements, use a dedicated script:

#!/bin/bash
# healthcheck.sh — Custom health check script

# Check if main process is running
if ! pgrep -x "myapp" > /dev/null; then
  echo "Main process not running"
  exit 1
fi

# Check disk space (fail if less than 10% free)
DISK_USAGE=$(df / | tail -1 | awk '{print $5}' | tr -d '%')
if [ "$DISK_USAGE" -gt 90 ]; then
  echo "Disk usage critical: ${DISK_USAGE}%"
  exit 1
fi

# Check if app responds within acceptable latency
RESPONSE_TIME=$(curl -o /dev/null -s -w '%{time_total}' http://localhost:8080/health)
if (( $(echo "$RESPONSE_TIME > 2.0" | bc -l) )); then
  echo "Response time too high: ${RESPONSE_TIME}s"
  exit 1
fi

echo "All checks passed"
exit 0

Reference it in your Dockerfile:

COPY healthcheck.sh /usr/local/bin/healthcheck.sh
RUN chmod +x /usr/local/bin/healthcheck.sh
HEALTHCHECK --interval=30s --timeout=15s --start-period=45s --retries=3 \
  CMD /usr/local/bin/healthcheck.sh

Health Checks in Docker Compose

Docker Compose lets you define health checks directly in your compose file, which is useful when you don't control the Dockerfile or want to override built-in health checks:

services:
  web:
    image: nginx:alpine
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:80/"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 10s

  api:
    build: ./api
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"]
      interval: 15s
      timeout: 5s
      retries: 5
      start_period: 30s

  postgres:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

A critical pattern is using health check conditions in service dependencies to ensure services start in the correct order:

services:
  postgres:
    image: postgres:16
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 10
      start_period: 30s

  api:
    build: ./api
    depends_on:
      postgres:
        condition: service_healthy
    environment:
      DATABASE_URL: postgres://postgres:secret@postgres:5432/mydb

Tip: The condition: service_healthy directive is the correct way to handle startup ordering in Docker Compose. It replaces the old depends_on behavior that only waited for the container to start, not for the service to be ready.

Orchestrator Integration

Health checks become even more powerful when integrated with orchestration platforms. Docker Swarm uses health checks natively for service management.

Docker Swarm

In Swarm mode, health checks directly influence service behavior:

# Deploy a service with health check configuration
docker service create \
  --name web \
  --replicas 3 \
  --health-cmd "curl -f http://localhost:8080/health || exit 1" \
  --health-interval 15s \
  --health-timeout 5s \
  --health-retries 3 \
  --health-start-period 30s \
  --update-delay 10s \
  --update-failure-action rollback \
  myapp:latest

During rolling updates, Swarm waits for new containers to become healthy before stopping old ones. If a new container fails its health check, Swarm can automatically roll back the update. This makes health checks the backbone of zero-downtime deployments.

Load Balancer Integration

When running behind a reverse proxy like Traefik or nginx, health checks determine traffic routing. An unhealthy container gets removed from the load balancer pool until it recovers:

services:
  web:
    image: myapp:latest
    deploy:
      replicas: 3
      labels:
        - "traefik.enable=true"
        - "traefik.http.services.web.loadbalancer.healthcheck.path=/health"
        - "traefik.http.services.web.loadbalancer.healthcheck.interval=10s"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 15s
      timeout: 5s
      retries: 3

Restart Policies and Health Checks

Docker's restart policies work alongside health checks, but it is important to understand they don't directly trigger restarts based on health status alone. The restart policy (--restart) applies when the main container process exits. Health checks inform orchestrators and monitoring tools but don't automatically restart containers in standalone Docker.

To get automatic restarts on health check failure, you have several options:

# Option 1: Use Docker Swarm (recommended for production)
# Swarm automatically replaces unhealthy tasks
docker service create --name myapp --replicas 1 \
  --health-cmd "curl -f http://localhost/health" \
  --restart-condition on-failure \
  myapp:latest

# Option 2: Use autoheal container
# This sidecar monitors and restarts unhealthy containers
docker run -d --name autoheal \
  --restart always \
  -e AUTOHEAL_CONTAINER_LABEL=all \
  -v /var/run/docker.sock:/var/run/docker.sock \
  willfarrell/autoheal

# Option 3: Use a monitoring tool like usulnet
# usulnet provides a dashboard that shows container health
# and lets you configure automatic restart policies

Tip: With usulnet, you get a visual overview of container health states across all your Docker hosts. The dashboard clearly shows which containers are healthy, unhealthy, or in starting state, so you can respond to issues quickly without running manual docker inspect commands.

Debugging Unhealthy Containers

When a container is marked unhealthy, you need to quickly diagnose the root cause. Docker provides several tools for this.

Inspect Health Check Logs

Docker stores the output of the last five health check executions:

# View health check status and recent results
docker inspect --format='{{json .State.Health}}' mycontainer | jq .

# Output example:
{
  "Status": "unhealthy",
  "FailingStreak": 5,
  "Log": [
    {
      "Start": "2025-02-06T10:30:00.123456Z",
      "End": "2025-02-06T10:30:02.456789Z",
      "ExitCode": 1,
      "Output": "curl: (7) Failed to connect to localhost port 8080"
    }
  ]
}

Check Container Events

# Watch health-related events in real time
docker events --filter event=health_status

# Filter for a specific container
docker events --filter container=mycontainer --filter event=health_status

Common Causes of Health Check Failures

Missing tools in the image: If your health check uses curl but your image is based on a minimal distribution that doesn't include it, the check will always fail. Use wget for Alpine-based images or install the necessary tools.
Wrong port or endpoint: The health check runs inside the container, so use the internal port, not the host-mapped port. If your app listens on port 3000 internally but you map it to 8080, the health check should target port 3000.
Insufficient start period: Applications with slow startup (JVM-based apps, apps with database migrations) need an adequate --start-period. Without it, the container may be marked unhealthy before it finishes initializing.
Timeout too short: If your health endpoint makes downstream calls, the timeout must account for the total response time. A health check that queries the database needs more time than one that returns a static response.
Resource constraints: If the container is under heavy load, the health check process itself may not get enough CPU time to execute within the timeout.

Quick Debugging Workflow

# Step 1: Check the current health status
docker inspect --format='{{.State.Health.Status}}' mycontainer

# Step 2: View the health check log output
docker inspect --format='{{range .State.Health.Log}}{{.Output}}{{end}}' mycontainer

# Step 3: Manually run the health check command
docker exec mycontainer curl -f http://localhost:8080/health

# Step 4: Check if the health check tool exists
docker exec mycontainer which curl

# Step 5: Check container logs for application errors
docker logs --tail 50 mycontainer

Best Practices for Production Health Checks

Keep Health Checks Lightweight

Health check commands consume resources every time they run. Avoid heavy operations inside health checks:

# Bad: Running a full database query
HEALTHCHECK CMD psql -U postgres -c "SELECT count(*) FROM large_table"

# Good: Minimal connectivity check
HEALTHCHECK CMD pg_isready -U postgres

Use Separate Liveness and Readiness Endpoints

Adopt the Kubernetes pattern even if you're not running Kubernetes. A liveness endpoint tells you the process is alive, while a readiness endpoint tells you it can handle traffic:

# /healthz — Liveness: is the process running?
app.get('/healthz', (req, res) => {
  res.status(200).send('OK');
});

# /readyz — Readiness: can it handle requests?
app.get('/readyz', async (req, res) => {
  const dbReady = await checkDatabase();
  const cacheReady = await checkRedis();
  if (dbReady && cacheReady) {
    res.status(200).json({ ready: true });
  } else {
    res.status(503).json({ ready: false, db: dbReady, cache: cacheReady });
  }
});

Avoid External Dependencies in Health Checks

A health check should test whether the container itself is healthy, not whether external services are reachable. If your health check fails because an upstream API is down, your container will be marked unhealthy and potentially restarted, which won't fix the upstream issue and will make the situation worse.

Configure Appropriate Timing

Consider your application's characteristics when setting health check timers:

Fast-starting services (nginx, Redis): interval=10s, timeout=5s, start-period=5s
Standard web apps (Node.js, Go): interval=15s, timeout=5s, start-period=15s
Slow-starting services (Java, databases): interval=30s, timeout=10s, start-period=60s
Background workers: interval=60s, timeout=10s, start-period=30s

Don't Disable Health Checks

While Docker allows HEALTHCHECK NONE to disable inherited health checks, avoid this in production. If a parent image defines a health check, either keep it or replace it with a more appropriate one rather than removing it entirely.

Monitoring Health Status at Scale

When managing dozens or hundreds of containers, you need centralized health monitoring. Running docker inspect on individual containers doesn't scale.

Tools like usulnet provide a real-time dashboard that aggregates health status across all containers and hosts. You can see at a glance which services are degraded and take action from a single interface. This visibility is essential for production environments where a single unhealthy container can cascade into a larger outage if not addressed promptly.

Effective health check implementation is one of the simplest yet most impactful improvements you can make to your container infrastructure. Start with basic HTTP checks, evolve to custom scripts as your requirements grow, and always ensure your orchestrator and monitoring tools are configured to act on health status changes.