Docker Container Monitoring: Tools, Metrics, and Setup Guide
Containers are ephemeral by design. They start, run, crash, restart, and scale up and down. Without monitoring, you're flying blind: you won't know that your API container has been using 95% of its memory limit for the last hour, or that your database container has restarted 12 times today, or that network latency between services has spiked.
This guide covers everything you need to set up production-grade Docker monitoring: what metrics to track, which tools to use, how to configure them, and how to set up alerts so you know about problems before your users do.
What to Monitor
Docker monitoring happens at three levels: container metrics, host metrics, and application metrics. You need all three for a complete picture.
Container Metrics
| Metric | Why It Matters | Alert When |
|---|---|---|
| CPU usage (%) | High CPU means your container is under load or stuck in a loop | > 80% sustained for 5+ minutes |
| Memory usage (bytes) | Approaching the limit means OOM kill is imminent | > 85% of memory limit |
| Memory limit | Know what the cap is | No limit set (production should always have limits) |
| Network I/O (bytes in/out) | Unusual spikes indicate attacks, data exfiltration, or misconfiguration | > 2x normal baseline |
| Disk I/O (reads/writes) | High disk I/O can bottleneck the host | Sustained high I/O causing latency |
| Restart count | Frequent restarts indicate instability | > 3 restarts in 1 hour |
| Container state | Is it running, paused, or dead? | Any production container not in "running" state |
| Health check status | Is the container actually serving requests? | Status changes to "unhealthy" |
Host Metrics
- Disk space — Docker images, volumes, and build cache consume disk. Alert at 80% usage.
- Total CPU/Memory — the aggregate across all containers plus host processes.
- Docker daemon status — if the daemon goes down, all containers stop.
- Available file descriptors — each container uses file descriptors; running out crashes things.
Application Metrics
Container metrics tell you how much resources your app uses. Application metrics tell you if it's actually working:
- Request rate and latency (p50, p95, p99)
- Error rate (5xx responses)
- Queue depth (for worker containers)
- Database connection pool utilization
- Custom business metrics
Quick Start: docker stats
Before setting up a full monitoring stack, Docker has a built-in command that gives you real-time metrics:
# Real-time stats for all containers
docker stats
# Stats for specific containers
docker stats myapp db redis
# One-shot (no streaming)
docker stats --no-stream
# Custom format
docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"
Output looks like:
NAME CPU % MEM USAGE / LIMIT NET I/O BLOCK I/O
myapp 2.35% 145.2MiB / 512MiB 1.2GB / 890MB 50MB / 10MB
db 0.85% 380.4MiB / 1GiB 500MB / 1.1GB 2.3GB / 1.8GB
redis 0.12% 28.1MiB / 256MiB 200MB / 180MB 0B / 512KB
docker stats is useful for quick checks but doesn't store historical data, generate graphs, or send alerts. For that, you need a proper monitoring stack.
The Prometheus + Grafana + cAdvisor Stack
The industry standard for container monitoring is the Prometheus ecosystem. Here's the architecture:
- cAdvisor — collects container metrics from the Docker daemon and exposes them in Prometheus format
- Node Exporter — collects host-level metrics (CPU, memory, disk, network)
- Prometheus — scrapes metrics from cAdvisor and Node Exporter, stores time-series data, evaluates alert rules
- Grafana — visualizes metrics with dashboards and charts
- Alertmanager — routes alerts from Prometheus to Slack, email, PagerDuty, etc.
Full Docker Compose Setup
# docker-compose.monitoring.yml
version: "3.8"
services:
# Collects container metrics
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
container_name: cadvisor
restart: unless-stopped
privileged: true
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
ports:
- "127.0.0.1:8080:8080"
networks:
- monitoring
# Collects host metrics
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
restart: unless-stopped
command:
- '--path.rootfs=/host'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /:/host:ro,rslave
ports:
- "127.0.0.1:9100:9100"
networks:
- monitoring
# Time-series database and alerting engine
prometheus:
image: prom/prometheus:v2.50.1
container_name: prometheus
restart: unless-stopped
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
- prometheus_data:/prometheus
ports:
- "127.0.0.1:9090:9090"
networks:
- monitoring
depends_on:
- cadvisor
- node-exporter
# Visualization
grafana:
image: grafana/grafana:10.3.3
container_name: grafana
restart: unless-stopped
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
ports:
- "127.0.0.1:3000:3000"
networks:
- monitoring
depends_on:
- prometheus
# Alert routing
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
restart: unless-stopped
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
ports:
- "127.0.0.1:9093:9093"
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
networks:
monitoring:
name: monitoring
Prometheus Configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert-rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Container metrics from cAdvisor
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
metric_relabel_configs:
# Drop high-cardinality metrics to save storage
- source_labels: [__name__]
regex: 'container_tasks_state'
action: drop
# Host metrics from Node Exporter
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
# Docker daemon metrics (enable in daemon.json)
- job_name: 'docker'
static_configs:
- targets: ['host.docker.internal:9323']
To enable Docker daemon metrics, add to /etc/docker/daemon.json:
{
"metrics-addr": "127.0.0.1:9323",
"experimental": true
}
Alert Rules
# prometheus/alert-rules.yml
groups:
- name: container-alerts
rules:
# Container is down
- alert: ContainerDown
expr: absent(container_last_seen{name=~".+"}) OR (time() - container_last_seen{name=~".+"}) > 60
for: 1m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} is down"
description: "Container {{ $labels.name }} has been down for more than 1 minute."
# High memory usage
- alert: ContainerHighMemory
expr: (container_memory_usage_bytes{name=~".+"} / container_spec_memory_limit_bytes{name=~".+"}) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} high memory usage ({{ $value }}%)"
description: "Container {{ $labels.name }} is using {{ $value }}% of its memory limit."
# High CPU usage
- alert: ContainerHighCPU
expr: (rate(container_cpu_usage_seconds_total{name=~".+"}[5m])) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} high CPU usage"
# Container restarting
- alert: ContainerRestarting
expr: increase(container_restart_count[1h]) > 3
for: 0m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} is restarting frequently"
description: "Container {{ $labels.name }} has restarted {{ $value }} times in the last hour."
- name: host-alerts
rules:
# Disk space low
- alert: HostDiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100 < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Host disk space low ({{ $value }}% remaining)"
# High host memory
- alert: HostHighMemory
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "Host memory usage above 90%"
Alertmanager Configuration
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack'
routes:
- match:
severity: critical
receiver: 'slack-critical'
repeat_interval: 1h
receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#monitoring'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'slack-critical'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#critical-alerts'
title: 'CRITICAL: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
Grafana Dashboard Provisioning
Auto-configure Grafana to connect to Prometheus on startup:
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
For dashboards, import the community Docker dashboard (ID: 193) from Grafana.com, or the cAdvisor dashboard (ID: 14282). These provide pre-built visualizations for all the key container metrics.
Useful Prometheus Queries (PromQL)
Here are PromQL queries you'll use frequently when building dashboards or debugging issues:
# CPU usage per container (percentage of one core)
rate(container_cpu_usage_seconds_total{name=~".+"}[5m]) * 100
# Memory usage per container
container_memory_usage_bytes{name=~".+"}
# Memory usage as percentage of limit
(container_memory_usage_bytes{name=~".+"} / container_spec_memory_limit_bytes{name=~".+"}) * 100
# Network received bytes per second
rate(container_network_receive_bytes_total{name=~".+"}[5m])
# Network transmitted bytes per second
rate(container_network_transmit_bytes_total{name=~".+"}[5m])
# Disk reads per second
rate(container_fs_reads_total{name=~".+"}[5m])
# Number of running containers
count(container_last_seen{name=~".+"})
# Top 5 containers by CPU usage
topk(5, rate(container_cpu_usage_seconds_total{name=~".+"}[5m]) * 100)
# Top 5 containers by memory usage
topk(5, container_memory_usage_bytes{name=~".+"})
Lightweight Alternatives
The full Prometheus + Grafana stack is powerful but might be overkill for a single-server setup. Here are lighter alternatives:
Built-in Docker Management Platform Monitoring
Docker management platforms like usulnet include built-in resource monitoring. You get CPU, memory, network, and disk metrics for every container directly in the management UI, no additional tools needed. This is often sufficient for small to medium deployments.
Glances
Glances is a system monitoring tool with Docker support. It's lighter than the full Prometheus stack and provides a web UI:
docker run -d \
--name glances \
--restart=unless-stopped \
-p 61208:61208 \
-e GLANCES_OPT="-w" \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
nicolargo/glances:latest
ctop
For a terminal-based experience, ctop provides a top-like interface for containers:
# Install
sudo wget https://github.com/bcicen/ctop/releases/latest/download/ctop-linux-amd64 -O /usr/local/bin/ctop
sudo chmod +x /usr/local/bin/ctop
# Run
ctop
Monitoring Best Practices
- Monitor the monitoring. If Prometheus goes down, you won't get alerts. Set up a simple external check (like Uptime Kuma or a cron + curl) to verify your monitoring stack is running.
- Set retention policies. Prometheus data grows fast. Set
--storage.tsdb.retention.time=30dto automatically expire old data. - Use recording rules for frequently computed queries. This pre-computes the result and makes dashboards load faster.
- Don't alert on everything. Alert fatigue is real. Only alert on conditions that require human intervention. Use dashboards for everything else.
- Label consistently. Use consistent labels across containers (environment, service, team) so you can filter and group metrics effectively.
- Monitor resource limits, not just usage. A container using 200 MB of memory is fine if the limit is 1 GB, but critical if the limit is 256 MB.
- Set up log aggregation alongside metrics. Metrics tell you something is wrong. Logs tell you why. Consider adding Loki (by Grafana Labs) to your monitoring stack for log aggregation.
Conclusion
Docker monitoring isn't a nice-to-have; it's a requirement for any production deployment. Start with docker stats to understand your baselines, then graduate to the Prometheus + Grafana stack as your needs grow. The key is to start somewhere: even basic monitoring is infinitely better than none.
Set up alerts for the conditions that actually matter (container down, high memory, disk space), build dashboards for the metrics you check regularly, and resist the urge to alert on everything. Your on-call rotation will thank you.