Monitoring Docker Swarm: Metrics, Logging and Alerting at Scale
You cannot operate what you cannot observe. This principle becomes especially acute in Docker Swarm, where services are distributed across multiple nodes, tasks are rescheduled automatically, and container logs are scattered across the cluster. Without a monitoring stack, you are debugging production by SSH-ing into nodes and running docker logs one container at a time.
This guide builds a complete observability stack for Docker Swarm using Prometheus (metrics), Grafana (visualization), Loki (logging), and Alertmanager (alerting). All components are deployed as Swarm services, dog-fooding the very infrastructure they monitor.
The Monitoring Architecture
| Component | Purpose | Deploy Mode | Scrape Targets |
|---|---|---|---|
| Prometheus | Metrics collection and storage | Replicated (1) | cAdvisor, node-exporter, Docker daemon |
| cAdvisor | Container resource metrics | Global (every node) | N/A (scraped by Prometheus) |
| node-exporter | Host-level metrics | Global (every node) | N/A (scraped by Prometheus) |
| Grafana | Dashboards and visualization | Replicated (1) | N/A (queries Prometheus/Loki) |
| Loki | Log aggregation | Replicated (1) | N/A (receives from Promtail) |
| Promtail | Log collection agent | Global (every node) | N/A (ships to Loki) |
| Alertmanager | Alert routing and deduplication | Replicated (1) | N/A (receives from Prometheus) |
Deploying the Monitoring Stack
Here is the complete stack file for the monitoring infrastructure:
version: "3.8"
services:
prometheus:
image: prom/prometheus:v2.51.0
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
- "--web.enable-lifecycle"
configs:
- source: prometheus_config
target: /etc/prometheus/prometheus.yml
- source: alert_rules
target: /etc/prometheus/alert_rules.yml
volumes:
- prometheus_data:/prometheus
ports:
- "9090:9090"
networks:
- monitoring
deploy:
replicas: 1
placement:
constraints:
- node.role == manager
resources:
limits:
cpus: "2.0"
memory: 4G
reservations:
cpus: "0.5"
memory: 1G
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
networks:
- monitoring
deploy:
mode: global
resources:
limits:
cpus: "0.5"
memory: 256M
reservations:
cpus: "0.1"
memory: 64M
node-exporter:
image: prom/node-exporter:v1.7.0
command:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
- "--path.rootfs=/rootfs"
- "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
networks:
- monitoring
deploy:
mode: global
resources:
limits:
cpus: "0.25"
memory: 128M
reservations:
cpus: "0.05"
memory: 32M
grafana:
image: grafana/grafana:10.4.0
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD_FILE: /run/secrets/grafana_password
GF_INSTALL_PLUGINS: grafana-clock-panel,grafana-piechart-panel
volumes:
- grafana_data:/var/lib/grafana
configs:
- source: grafana_datasources
target: /etc/grafana/provisioning/datasources/datasources.yml
secrets:
- grafana_password
ports:
- "3000:3000"
networks:
- monitoring
deploy:
replicas: 1
resources:
limits:
cpus: "1.0"
memory: 512M
loki:
image: grafana/loki:2.9.5
command: -config.file=/etc/loki/loki.yml
configs:
- source: loki_config
target: /etc/loki/loki.yml
volumes:
- loki_data:/loki
ports:
- "3100:3100"
networks:
- monitoring
deploy:
replicas: 1
resources:
limits:
cpus: "1.0"
memory: 1G
promtail:
image: grafana/promtail:2.9.5
command: -config.file=/etc/promtail/promtail.yml
configs:
- source: promtail_config
target: /etc/promtail/promtail.yml
volumes:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
networks:
- monitoring
deploy:
mode: global
resources:
limits:
cpus: "0.25"
memory: 128M
alertmanager:
image: prom/alertmanager:v0.27.0
command:
- "--config.file=/etc/alertmanager/alertmanager.yml"
- "--storage.path=/alertmanager"
configs:
- source: alertmanager_config
target: /etc/alertmanager/alertmanager.yml
volumes:
- alertmanager_data:/alertmanager
ports:
- "9093:9093"
networks:
- monitoring
deploy:
replicas: 1
networks:
monitoring:
driver: overlay
attachable: true
volumes:
prometheus_data:
grafana_data:
loki_data:
alertmanager_data:
configs:
prometheus_config:
file: ./prometheus/prometheus.yml
alert_rules:
file: ./prometheus/alert_rules.yml
grafana_datasources:
file: ./grafana/datasources.yml
loki_config:
file: ./loki/loki.yml
promtail_config:
file: ./promtail/promtail.yml
alertmanager_config:
file: ./alertmanager/alertmanager.yml
secrets:
grafana_password:
external: true
Prometheus Configuration for Swarm
The key challenge with Prometheus in Swarm is service discovery. Prometheus needs to find all cAdvisor and node-exporter instances automatically as nodes join or leave the cluster.
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- "alert_rules.yml"
scrape_configs:
# Scrape Prometheus itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Scrape cAdvisor on all nodes via DNS
- job_name: "cadvisor"
dns_sd_configs:
- names:
- "tasks.cadvisor"
type: "A"
port: 8080
metrics_path: /metrics
# Scrape node-exporter on all nodes via DNS
- job_name: "node-exporter"
dns_sd_configs:
- names:
- "tasks.node-exporter"
type: "A"
port: 9100
# Scrape Docker daemon metrics (enable in daemon.json)
- job_name: "docker"
dns_sd_configs:
- names:
- "tasks.node-exporter" # Co-located on every node
type: "A"
port: 9323
# Scrape application services with metrics endpoints
- job_name: "app-services"
dns_sd_configs:
- names:
- "tasks.api"
type: "A"
port: 9090
metrics_path: /metrics
{"metrics-addr": "0.0.0.0:9323", "experimental": true} to /etc/docker/daemon.json on every Swarm node. This exposes container, image, and network metrics directly from the Docker engine.
Essential Alert Rules
# prometheus/alert_rules.yml
groups:
- name: swarm_cluster
rules:
- alert: SwarmNodeDown
expr: up{job="node-exporter"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Swarm node is unreachable"
description: "Node {{ $labels.instance }} has been down for 2 minutes"
- alert: SwarmManagerQuorumAtRisk
expr: count(up{job="node-exporter"} == 1) < 2
for: 1m
labels:
severity: critical
annotations:
summary: "Manager quorum at risk"
description: "Fewer than 2 nodes are reachable"
- name: container_alerts
rules:
- alert: ContainerHighCPU
expr: rate(container_cpu_usage_seconds_total{name!=""}[5m]) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} high CPU usage"
description: "Container {{ $labels.name }} CPU usage above 90% for 5 minutes"
- alert: ContainerHighMemory
expr: container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} high memory usage"
description: "Container {{ $labels.name }} memory at {{ humanizePercentage $value }}"
- alert: ContainerOOMKilled
expr: increase(container_oom_events_total[5m]) > 0
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} OOM killed"
description: "Container {{ $labels.name }} was OOM killed"
- alert: ServiceReplicasMismatch
expr: docker_swarm_tasks_running != docker_swarm_tasks_desired
for: 5m
labels:
severity: warning
annotations:
summary: "Service has fewer replicas than desired"
description: "Service {{ $labels.service_name }}: {{ $value }} running vs desired"
- name: host_alerts
rules:
- alert: HostHighDiskUsage
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "{{ $labels.instance }} has less than 15% disk space remaining"
- alert: HostHighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "{{ $labels.instance }} memory usage above 90%"
Centralized Logging with Loki
Loki collects and indexes logs from all containers across the Swarm. Unlike Elasticsearch, Loki only indexes metadata (labels), not the log content, making it significantly cheaper to operate.
# loki/loki.yml
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
limits_config:
retention_period: 14d
max_query_series: 5000
compactor:
working_directory: /loki/compactor
retention_enabled: true
# promtail/promtail.yml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
# Extract container name
- source_labels: ["__meta_docker_container_name"]
regex: "/(.*)"
target_label: "container"
# Extract Swarm service name
- source_labels: ["__meta_docker_container_label_com_docker_swarm_service_name"]
target_label: "service"
# Extract Swarm stack name
- source_labels: ["__meta_docker_container_label_com_docker_stack_namespace"]
target_label: "stack"
# Extract node ID
- source_labels: ["__meta_docker_container_label_com_docker_swarm_node_id"]
target_label: "node_id"
pipeline_stages:
- docker: {}
- timestamp:
source: time
format: RFC3339Nano
Grafana Dashboard Configuration
# grafana/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
- name: Loki
type: loki
access: proxy
url: http://loki:3100
Key Dashboard Panels
Build these essential panels in your Swarm monitoring dashboard:
| Panel | PromQL Query | Visualization |
|---|---|---|
| Cluster CPU Usage | sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) / count(node_cpu_seconds_total{mode="idle"}) * 100 |
Gauge |
| Cluster Memory | sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes) * 100 |
Gauge |
| Container CPU by Service | sum by (container_label_com_docker_swarm_service_name)(rate(container_cpu_usage_seconds_total[5m])) |
Time series |
| Container Memory by Service | sum by (container_label_com_docker_swarm_service_name)(container_memory_usage_bytes) |
Time series |
| Network I/O | sum by (name)(rate(container_network_receive_bytes_total[5m])) |
Time series |
| Container Restarts | increase(container_restart_count[1h]) |
Stat |
Alertmanager Configuration
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ["alertname", "severity"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: "default"
routes:
- match:
severity: critical
receiver: "pagerduty"
repeat_interval: 1h
- match:
severity: warning
receiver: "slack"
receivers:
- name: "default"
webhook_configs:
- url: "http://alertmanager-webhook:8080/webhook"
- name: "slack"
slack_configs:
- api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
channel: "#alerts"
title: '{{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*{{ .Annotations.summary }}*
{{ .Annotations.description }}
{{ end }}
- name: "pagerduty"
pagerduty_configs:
- service_key: "YOUR_PAGERDUTY_SERVICE_KEY"
severity: '{{ .GroupLabels.severity }}'
Service-Level Monitoring
Beyond infrastructure metrics, instrument your application services to expose business-level metrics:
# Example: Prometheus metrics in a Go service
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
}
// Expose /metrics endpoint
http.Handle("/metrics", promhttp.Handler())
Deploying the Stack
# Create the Grafana admin password secret
echo "your-secure-password" | docker secret create grafana_password -
# Deploy the monitoring stack
docker stack deploy -c monitoring-stack.yml monitoring
# Verify all services are running
docker stack services monitoring
# Check global services have tasks on all nodes
docker service ps monitoring_cadvisor
docker service ps monitoring_node-exporter
docker service ps monitoring_promtail
usulnet for Swarm Visibility
While Prometheus and Grafana provide deep metrics and alerting, they require significant configuration and maintenance. usulnet provides an alternative approach for teams that want immediate visibility into their Swarm cluster without building a monitoring stack from scratch.
usulnet connects to your Docker Swarm nodes and provides real-time dashboards showing container health, resource utilization, service status, and deployment history. It complements rather than replaces Prometheus: use usulnet for operational visibility and quick troubleshooting, and Prometheus for deep metrics analysis and long-term trending.
Log Querying with Loki
# Query logs for a specific service in Grafana (LogQL)
{service="myapp_api"} |= "error"
# Filter by stack and service
{stack="myapp", service="myapp_api"} | json | level="error"
# Rate of error logs per minute
rate({service="myapp_api"} |= "error" [1m])
# Top 10 most common error messages
topk(10, sum by (message)(count_over_time({service="myapp_api"} |= "error" | json | line_format "{{.message}}" [1h])))
Conclusion
A production Swarm cluster needs three layers of observability: metrics (Prometheus + cAdvisor + node-exporter), logs (Loki + Promtail), and alerting (Alertmanager). Deploy all components as Swarm services, use DNS-based service discovery for automatic target registration, and set resource limits on monitoring services to prevent them from competing with your applications.
Start with the stack file in this guide, customize the alert rules for your SLOs, and build Grafana dashboards that answer the questions your team actually asks during incidents. The monitoring stack itself should be the most reliable thing in your cluster.