Docker Container Monitoring: Tools, Metrics, and Setup Guide

Containers are ephemeral by design. They start, run, crash, restart, and scale up and down. Without monitoring, you're flying blind: you won't know that your API container has been using 95% of its memory limit for the last hour, or that your database container has restarted 12 times today, or that network latency between services has spiked.

This guide covers everything you need to set up production-grade Docker monitoring: what metrics to track, which tools to use, how to configure them, and how to set up alerts so you know about problems before your users do.

What to Monitor

Docker monitoring happens at three levels: container metrics, host metrics, and application metrics. You need all three for a complete picture.

Container Metrics

Metric Why It Matters Alert When
CPU usage (%) High CPU means your container is under load or stuck in a loop > 80% sustained for 5+ minutes
Memory usage (bytes) Approaching the limit means OOM kill is imminent > 85% of memory limit
Memory limit Know what the cap is No limit set (production should always have limits)
Network I/O (bytes in/out) Unusual spikes indicate attacks, data exfiltration, or misconfiguration > 2x normal baseline
Disk I/O (reads/writes) High disk I/O can bottleneck the host Sustained high I/O causing latency
Restart count Frequent restarts indicate instability > 3 restarts in 1 hour
Container state Is it running, paused, or dead? Any production container not in "running" state
Health check status Is the container actually serving requests? Status changes to "unhealthy"

Host Metrics

  • Disk space — Docker images, volumes, and build cache consume disk. Alert at 80% usage.
  • Total CPU/Memory — the aggregate across all containers plus host processes.
  • Docker daemon status — if the daemon goes down, all containers stop.
  • Available file descriptors — each container uses file descriptors; running out crashes things.

Application Metrics

Container metrics tell you how much resources your app uses. Application metrics tell you if it's actually working:

  • Request rate and latency (p50, p95, p99)
  • Error rate (5xx responses)
  • Queue depth (for worker containers)
  • Database connection pool utilization
  • Custom business metrics

Quick Start: docker stats

Before setting up a full monitoring stack, Docker has a built-in command that gives you real-time metrics:

# Real-time stats for all containers
docker stats

# Stats for specific containers
docker stats myapp db redis

# One-shot (no streaming)
docker stats --no-stream

# Custom format
docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"

Output looks like:

NAME     CPU %   MEM USAGE / LIMIT     NET I/O           BLOCK I/O
myapp    2.35%   145.2MiB / 512MiB     1.2GB / 890MB     50MB / 10MB
db       0.85%   380.4MiB / 1GiB       500MB / 1.1GB     2.3GB / 1.8GB
redis    0.12%   28.1MiB / 256MiB      200MB / 180MB     0B / 512KB

docker stats is useful for quick checks but doesn't store historical data, generate graphs, or send alerts. For that, you need a proper monitoring stack.

The Prometheus + Grafana + cAdvisor Stack

The industry standard for container monitoring is the Prometheus ecosystem. Here's the architecture:

  • cAdvisor — collects container metrics from the Docker daemon and exposes them in Prometheus format
  • Node Exporter — collects host-level metrics (CPU, memory, disk, network)
  • Prometheus — scrapes metrics from cAdvisor and Node Exporter, stores time-series data, evaluates alert rules
  • Grafana — visualizes metrics with dashboards and charts
  • Alertmanager — routes alerts from Prometheus to Slack, email, PagerDuty, etc.

Full Docker Compose Setup

# docker-compose.monitoring.yml
version: "3.8"

services:
  # Collects container metrics
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    ports:
      - "127.0.0.1:8080:8080"
    networks:
      - monitoring

  # Collects host metrics
  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    restart: unless-stopped
    command:
      - '--path.rootfs=/host'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /:/host:ro,rslave
    ports:
      - "127.0.0.1:9100:9100"
    networks:
      - monitoring

  # Time-series database and alerting engine
  prometheus:
    image: prom/prometheus:v2.50.1
    container_name: prometheus
    restart: unless-stopped
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
      - prometheus_data:/prometheus
    ports:
      - "127.0.0.1:9090:9090"
    networks:
      - monitoring
    depends_on:
      - cadvisor
      - node-exporter

  # Visualization
  grafana:
    image: grafana/grafana:10.3.3
    container_name: grafana
    restart: unless-stopped
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=changeme
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    ports:
      - "127.0.0.1:3000:3000"
    networks:
      - monitoring
    depends_on:
      - prometheus

  # Alert routing
  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    ports:
      - "127.0.0.1:9093:9093"
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

networks:
  monitoring:
    name: monitoring

Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert-rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Container metrics from cAdvisor
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    metric_relabel_configs:
      # Drop high-cardinality metrics to save storage
      - source_labels: [__name__]
        regex: 'container_tasks_state'
        action: drop

  # Host metrics from Node Exporter
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  # Docker daemon metrics (enable in daemon.json)
  - job_name: 'docker'
    static_configs:
      - targets: ['host.docker.internal:9323']

To enable Docker daemon metrics, add to /etc/docker/daemon.json:

{
  "metrics-addr": "127.0.0.1:9323",
  "experimental": true
}

Alert Rules

# prometheus/alert-rules.yml
groups:
  - name: container-alerts
    rules:
      # Container is down
      - alert: ContainerDown
        expr: absent(container_last_seen{name=~".+"}) OR (time() - container_last_seen{name=~".+"}) > 60
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} is down"
          description: "Container {{ $labels.name }} has been down for more than 1 minute."

      # High memory usage
      - alert: ContainerHighMemory
        expr: (container_memory_usage_bytes{name=~".+"} / container_spec_memory_limit_bytes{name=~".+"}) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} high memory usage ({{ $value }}%)"
          description: "Container {{ $labels.name }} is using {{ $value }}% of its memory limit."

      # High CPU usage
      - alert: ContainerHighCPU
        expr: (rate(container_cpu_usage_seconds_total{name=~".+"}[5m])) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} high CPU usage"

      # Container restarting
      - alert: ContainerRestarting
        expr: increase(container_restart_count[1h]) > 3
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} is restarting frequently"
          description: "Container {{ $labels.name }} has restarted {{ $value }} times in the last hour."

  - name: host-alerts
    rules:
      # Disk space low
      - alert: HostDiskSpaceLow
        expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100 < 20
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host disk space low ({{ $value }}% remaining)"

      # High host memory
      - alert: HostHighMemory
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Host memory usage above 90%"

Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack'

  routes:
    - match:
        severity: critical
      receiver: 'slack-critical'
      repeat_interval: 1h

receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#monitoring'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'slack-critical'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#critical-alerts'
        title: 'CRITICAL: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

Grafana Dashboard Provisioning

Auto-configure Grafana to connect to Prometheus on startup:

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

For dashboards, import the community Docker dashboard (ID: 193) from Grafana.com, or the cAdvisor dashboard (ID: 14282). These provide pre-built visualizations for all the key container metrics.

Useful Prometheus Queries (PromQL)

Here are PromQL queries you'll use frequently when building dashboards or debugging issues:

# CPU usage per container (percentage of one core)
rate(container_cpu_usage_seconds_total{name=~".+"}[5m]) * 100

# Memory usage per container
container_memory_usage_bytes{name=~".+"}

# Memory usage as percentage of limit
(container_memory_usage_bytes{name=~".+"} / container_spec_memory_limit_bytes{name=~".+"}) * 100

# Network received bytes per second
rate(container_network_receive_bytes_total{name=~".+"}[5m])

# Network transmitted bytes per second
rate(container_network_transmit_bytes_total{name=~".+"}[5m])

# Disk reads per second
rate(container_fs_reads_total{name=~".+"}[5m])

# Number of running containers
count(container_last_seen{name=~".+"})

# Top 5 containers by CPU usage
topk(5, rate(container_cpu_usage_seconds_total{name=~".+"}[5m]) * 100)

# Top 5 containers by memory usage
topk(5, container_memory_usage_bytes{name=~".+"})

Lightweight Alternatives

The full Prometheus + Grafana stack is powerful but might be overkill for a single-server setup. Here are lighter alternatives:

Built-in Docker Management Platform Monitoring

Docker management platforms like usulnet include built-in resource monitoring. You get CPU, memory, network, and disk metrics for every container directly in the management UI, no additional tools needed. This is often sufficient for small to medium deployments.

Glances

Glances is a system monitoring tool with Docker support. It's lighter than the full Prometheus stack and provides a web UI:

docker run -d \
  --name glances \
  --restart=unless-stopped \
  -p 61208:61208 \
  -e GLANCES_OPT="-w" \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  nicolargo/glances:latest

ctop

For a terminal-based experience, ctop provides a top-like interface for containers:

# Install
sudo wget https://github.com/bcicen/ctop/releases/latest/download/ctop-linux-amd64 -O /usr/local/bin/ctop
sudo chmod +x /usr/local/bin/ctop

# Run
ctop

Monitoring Best Practices

  1. Monitor the monitoring. If Prometheus goes down, you won't get alerts. Set up a simple external check (like Uptime Kuma or a cron + curl) to verify your monitoring stack is running.
  2. Set retention policies. Prometheus data grows fast. Set --storage.tsdb.retention.time=30d to automatically expire old data.
  3. Use recording rules for frequently computed queries. This pre-computes the result and makes dashboards load faster.
  4. Don't alert on everything. Alert fatigue is real. Only alert on conditions that require human intervention. Use dashboards for everything else.
  5. Label consistently. Use consistent labels across containers (environment, service, team) so you can filter and group metrics effectively.
  6. Monitor resource limits, not just usage. A container using 200 MB of memory is fine if the limit is 1 GB, but critical if the limit is 256 MB.
  7. Set up log aggregation alongside metrics. Metrics tell you something is wrong. Logs tell you why. Consider adding Loki (by Grafana Labs) to your monitoring stack for log aggregation.
Quick win: If you're just getting started with monitoring and don't want to set up the full Prometheus stack, deploy usulnet for built-in container monitoring alongside management. You can always add Prometheus + Grafana later when you need historical data and custom dashboards.

Conclusion

Docker monitoring isn't a nice-to-have; it's a requirement for any production deployment. Start with docker stats to understand your baselines, then graduate to the Prometheus + Grafana stack as your needs grow. The key is to start somewhere: even basic monitoring is infinitely better than none.

Set up alerts for the conditions that actually matter (container down, high memory, disk space), build dashboards for the metrics you check regularly, and resist the urge to alert on everything. Your on-call rotation will thank you.