Operations

Monitoring Docker Swarm: Metrics, Logging and Alerting at Scale

April 6, 2025 · 20 min read

You cannot operate what you cannot observe. This principle becomes especially acute in Docker Swarm, where services are distributed across multiple nodes, tasks are rescheduled automatically, and container logs are scattered across the cluster. Without a monitoring stack, you are debugging production by SSH-ing into nodes and running docker logs one container at a time.

This guide builds a complete observability stack for Docker Swarm using Prometheus (metrics), Grafana (visualization), Loki (logging), and Alertmanager (alerting). All components are deployed as Swarm services, dog-fooding the very infrastructure they monitor.

The Monitoring Architecture

Component	Purpose	Deploy Mode	Scrape Targets
Prometheus	Metrics collection and storage	Replicated (1)	cAdvisor, node-exporter, Docker daemon
cAdvisor	Container resource metrics	Global (every node)	N/A (scraped by Prometheus)
node-exporter	Host-level metrics	Global (every node)	N/A (scraped by Prometheus)
Grafana	Dashboards and visualization	Replicated (1)	N/A (queries Prometheus/Loki)
Loki	Log aggregation	Replicated (1)	N/A (receives from Promtail)
Promtail	Log collection agent	Global (every node)	N/A (ships to Loki)
Alertmanager	Alert routing and deduplication	Replicated (1)	N/A (receives from Prometheus)

Deploying the Monitoring Stack

Here is the complete stack file for the monitoring infrastructure:

version: "3.8"

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--web.enable-lifecycle"
    configs:
      - source: prometheus_config
        target: /etc/prometheus/prometheus.yml
      - source: alert_rules
        target: /etc/prometheus/alert_rules.yml
    volumes:
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    networks:
      - monitoring
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.role == manager
      resources:
        limits:
          cpus: "2.0"
          memory: 4G
        reservations:
          cpus: "0.5"
          memory: 1G

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    networks:
      - monitoring
    deploy:
      mode: global
      resources:
        limits:
          cpus: "0.5"
          memory: 256M
        reservations:
          cpus: "0.1"
          memory: 64M

  node-exporter:
    image: prom/node-exporter:v1.7.0
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--path.rootfs=/rootfs"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    networks:
      - monitoring
    deploy:
      mode: global
      resources:
        limits:
          cpus: "0.25"
          memory: 128M
        reservations:
          cpus: "0.05"
          memory: 32M

  grafana:
    image: grafana/grafana:10.4.0
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD_FILE: /run/secrets/grafana_password
      GF_INSTALL_PLUGINS: grafana-clock-panel,grafana-piechart-panel
    volumes:
      - grafana_data:/var/lib/grafana
    configs:
      - source: grafana_datasources
        target: /etc/grafana/provisioning/datasources/datasources.yml
    secrets:
      - grafana_password
    ports:
      - "3000:3000"
    networks:
      - monitoring
    deploy:
      replicas: 1
      resources:
        limits:
          cpus: "1.0"
          memory: 512M

  loki:
    image: grafana/loki:2.9.5
    command: -config.file=/etc/loki/loki.yml
    configs:
      - source: loki_config
        target: /etc/loki/loki.yml
    volumes:
      - loki_data:/loki
    ports:
      - "3100:3100"
    networks:
      - monitoring
    deploy:
      replicas: 1
      resources:
        limits:
          cpus: "1.0"
          memory: 1G

  promtail:
    image: grafana/promtail:2.9.5
    command: -config.file=/etc/promtail/promtail.yml
    configs:
      - source: promtail_config
        target: /etc/promtail/promtail.yml
    volumes:
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    networks:
      - monitoring
    deploy:
      mode: global
      resources:
        limits:
          cpus: "0.25"
          memory: 128M

  alertmanager:
    image: prom/alertmanager:v0.27.0
    command:
      - "--config.file=/etc/alertmanager/alertmanager.yml"
      - "--storage.path=/alertmanager"
    configs:
      - source: alertmanager_config
        target: /etc/alertmanager/alertmanager.yml
    volumes:
      - alertmanager_data:/alertmanager
    ports:
      - "9093:9093"
    networks:
      - monitoring
    deploy:
      replicas: 1

networks:
  monitoring:
    driver: overlay
    attachable: true

volumes:
  prometheus_data:
  grafana_data:
  loki_data:
  alertmanager_data:

configs:
  prometheus_config:
    file: ./prometheus/prometheus.yml
  alert_rules:
    file: ./prometheus/alert_rules.yml
  grafana_datasources:
    file: ./grafana/datasources.yml
  loki_config:
    file: ./loki/loki.yml
  promtail_config:
    file: ./promtail/promtail.yml
  alertmanager_config:
    file: ./alertmanager/alertmanager.yml

secrets:
  grafana_password:
    external: true

Prometheus Configuration for Swarm

The key challenge with Prometheus in Swarm is service discovery. Prometheus needs to find all cAdvisor and node-exporter instances automatically as nodes join or leave the cluster.

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - "alert_rules.yml"

scrape_configs:
  # Scrape Prometheus itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Scrape cAdvisor on all nodes via DNS
  - job_name: "cadvisor"
    dns_sd_configs:
      - names:
          - "tasks.cadvisor"
        type: "A"
        port: 8080
    metrics_path: /metrics

  # Scrape node-exporter on all nodes via DNS
  - job_name: "node-exporter"
    dns_sd_configs:
      - names:
          - "tasks.node-exporter"
        type: "A"
        port: 9100

  # Scrape Docker daemon metrics (enable in daemon.json)
  - job_name: "docker"
    dns_sd_configs:
      - names:
          - "tasks.node-exporter"  # Co-located on every node
        type: "A"
        port: 9323

  # Scrape application services with metrics endpoints
  - job_name: "app-services"
    dns_sd_configs:
      - names:
          - "tasks.api"
        type: "A"
        port: 9090
    metrics_path: /metrics

Tip: Enable Docker daemon metrics by adding {"metrics-addr": "0.0.0.0:9323", "experimental": true} to /etc/docker/daemon.json on every Swarm node. This exposes container, image, and network metrics directly from the Docker engine.

Essential Alert Rules

# prometheus/alert_rules.yml
groups:
  - name: swarm_cluster
    rules:
      - alert: SwarmNodeDown
        expr: up{job="node-exporter"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Swarm node is unreachable"
          description: "Node {{ $labels.instance }} has been down for 2 minutes"

      - alert: SwarmManagerQuorumAtRisk
        expr: count(up{job="node-exporter"} == 1) < 2
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Manager quorum at risk"
          description: "Fewer than 2 nodes are reachable"

  - name: container_alerts
    rules:
      - alert: ContainerHighCPU
        expr: rate(container_cpu_usage_seconds_total{name!=""}[5m]) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} high CPU usage"
          description: "Container {{ $labels.name }} CPU usage above 90% for 5 minutes"

      - alert: ContainerHighMemory
        expr: container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} high memory usage"
          description: "Container {{ $labels.name }} memory at {{ humanizePercentage $value }}"

      - alert: ContainerOOMKilled
        expr: increase(container_oom_events_total[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} OOM killed"
          description: "Container {{ $labels.name }} was OOM killed"

      - alert: ServiceReplicasMismatch
        expr: docker_swarm_tasks_running != docker_swarm_tasks_desired
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Service has fewer replicas than desired"
          description: "Service {{ $labels.service_name }}: {{ $value }} running vs desired"

  - name: host_alerts
    rules:
      - alert: HostHighDiskUsage
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "{{ $labels.instance }} has less than 15% disk space remaining"

      - alert: HostHighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "{{ $labels.instance }} memory usage above 90%"

Centralized Logging with Loki

Loki collects and indexes logs from all containers across the Swarm. Unlike Elasticsearch, Loki only indexes metadata (labels), not the log content, making it significantly cheaper to operate.

# loki/loki.yml
auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  retention_period: 14d
  max_query_series: 5000

compactor:
  working_directory: /loki/compactor
  retention_enabled: true

# promtail/promtail.yml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      # Extract container name
      - source_labels: ["__meta_docker_container_name"]
        regex: "/(.*)"
        target_label: "container"
      # Extract Swarm service name
      - source_labels: ["__meta_docker_container_label_com_docker_swarm_service_name"]
        target_label: "service"
      # Extract Swarm stack name
      - source_labels: ["__meta_docker_container_label_com_docker_stack_namespace"]
        target_label: "stack"
      # Extract node ID
      - source_labels: ["__meta_docker_container_label_com_docker_swarm_node_id"]
        target_label: "node_id"
    pipeline_stages:
      - docker: {}
      - timestamp:
          source: time
          format: RFC3339Nano

Grafana Dashboard Configuration

# grafana/datasources.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100

Key Dashboard Panels

Build these essential panels in your Swarm monitoring dashboard:

Panel	PromQL Query	Visualization
Cluster CPU Usage	`sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) / count(node_cpu_seconds_total{mode="idle"}) * 100`	Gauge
Cluster Memory	`sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes) * 100`	Gauge
Container CPU by Service	`sum by (container_label_com_docker_swarm_service_name)(rate(container_cpu_usage_seconds_total[5m]))`	Time series
Container Memory by Service	`sum by (container_label_com_docker_swarm_service_name)(container_memory_usage_bytes)`	Time series
Network I/O	`sum by (name)(rate(container_network_receive_bytes_total[5m]))`	Time series
Container Restarts	`increase(container_restart_count[1h])`	Stat

Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ["alertname", "severity"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "default"
  routes:
    - match:
        severity: critical
      receiver: "pagerduty"
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: "slack"

receivers:
  - name: "default"
    webhook_configs:
      - url: "http://alertmanager-webhook:8080/webhook"

  - name: "slack"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
        channel: "#alerts"
        title: '{{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *{{ .Annotations.summary }}*
          {{ .Annotations.description }}
          {{ end }}

  - name: "pagerduty"
    pagerduty_configs:
      - service_key: "YOUR_PAGERDUTY_SERVICE_KEY"
        severity: '{{ .GroupLabels.severity }}'

Service-Level Monitoring

Beyond infrastructure metrics, instrument your application services to expose business-level metrics:

# Example: Prometheus metrics in a Go service
import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
}

// Expose /metrics endpoint
http.Handle("/metrics", promhttp.Handler())

Deploying the Stack

# Create the Grafana admin password secret
echo "your-secure-password" | docker secret create grafana_password -

# Deploy the monitoring stack
docker stack deploy -c monitoring-stack.yml monitoring

# Verify all services are running
docker stack services monitoring

# Check global services have tasks on all nodes
docker service ps monitoring_cadvisor
docker service ps monitoring_node-exporter
docker service ps monitoring_promtail

Warning: The monitoring stack itself consumes resources. On a small cluster (3-5 nodes), cAdvisor and node-exporter use roughly 100-300MB RAM per node. Prometheus and Loki can consume significant memory depending on cardinality and retention. Always set resource limits on monitoring services to prevent them from starving your application workloads.

usulnet for Swarm Visibility

While Prometheus and Grafana provide deep metrics and alerting, they require significant configuration and maintenance. usulnet provides an alternative approach for teams that want immediate visibility into their Swarm cluster without building a monitoring stack from scratch.

usulnet connects to your Docker Swarm nodes and provides real-time dashboards showing container health, resource utilization, service status, and deployment history. It complements rather than replaces Prometheus: use usulnet for operational visibility and quick troubleshooting, and Prometheus for deep metrics analysis and long-term trending.

Log Querying with Loki

# Query logs for a specific service in Grafana (LogQL)
{service="myapp_api"} |= "error"

# Filter by stack and service
{stack="myapp", service="myapp_api"} | json | level="error"

# Rate of error logs per minute
rate({service="myapp_api"} |= "error" [1m])

# Top 10 most common error messages
topk(10, sum by (message)(count_over_time({service="myapp_api"} |= "error" | json | line_format "{{.message}}" [1h])))

Conclusion

A production Swarm cluster needs three layers of observability: metrics (Prometheus + cAdvisor + node-exporter), logs (Loki + Promtail), and alerting (Alertmanager). Deploy all components as Swarm services, use DNS-based service discovery for automatic target registration, and set resource limits on monitoring services to prevent them from competing with your applications.

Start with the stack file in this guide, customize the alert rules for your SLOs, and build Grafana dashboards that answer the questions your team actually asks during incidents. The monitoring stack itself should be the most reliable thing in your cluster.

The Monitoring Architecture

Deploying the Monitoring Stack

Prometheus Configuration for Swarm

Essential Alert Rules

Centralized Logging with Loki

Grafana Dashboard Configuration

Key Dashboard Panels

Alertmanager Configuration

Service-Level Monitoring

Deploying the Stack

usulnet for Swarm Visibility

Log Querying with Loki

Conclusion

Related Articles

Docker Swarm Troubleshooting: Diagnosing and Fixing Common Issues

Docker Monitoring Guide: Tracking Container Health and Performance

Docker Logging Best Practices: Centralized Logging for Containers