Architecture

Docker Swarm High Availability: Building Resilient Container Infrastructure

April 10, 2025 · 19 min read

High availability in Docker Swarm means your services survive node failures, network partitions, and rolling updates without dropping user requests. Swarm provides the building blocks -- Raft consensus for manager HA, automatic task rescheduling, health checks, and restart policies -- but assembling them into a truly resilient architecture requires deliberate design.

This guide covers every layer of HA: the control plane (manager quorum), the data plane (service distribution), application-level resilience (health checks and restart policies), external load balancer integration, and planning for the failure scenarios that will inevitably occur.

Manager High Availability: Raft Consensus

The Swarm control plane is a distributed state machine built on the Raft consensus algorithm. All cluster state (services, tasks, secrets, configs, networks) is stored in the Raft log, which is replicated across all manager nodes.

How Raft Works in Swarm

One manager is elected as the leader. All write operations go through the leader.
The leader replicates every write to the follower managers via the Raft log.
A write is committed only when a majority (quorum) of managers acknowledge it.
If the leader fails, the remaining managers hold an election. The first to receive votes from a majority becomes the new leader.
During election, the cluster is briefly read-only (typically under 5 seconds). Running containers are not affected.

Managers	Quorum	Tolerated Failures	Election Latency
1	1	0	N/A
3	2	1	1-5 seconds
5	3	2	1-5 seconds
7	4	3	2-10 seconds

Critical rule: Always use an odd number of managers. With an even number, you gain no additional fault tolerance but increase the Raft replication overhead. Three managers tolerate 1 failure; four managers also tolerate only 1 failure (quorum is 3 out of 4). You pay for more overhead with no benefit.

Manager Placement Strategy

Distribute managers across failure domains to maximize resilience:

# 3-manager deployment across 3 availability zones
docker node update --label-add zone=us-east-1a manager-01
docker node update --label-add zone=us-east-1b manager-02
docker node update --label-add zone=us-east-1c manager-03

# 5-manager deployment across 3 zones (2-2-1 split)
# Zone A: manager-01, manager-04
# Zone B: manager-02, manager-05
# Zone C: manager-03
# Loss of any single zone preserves quorum

Warning: Never place all managers in the same rack, datacenter, or availability zone. A single power failure, network switch issue, or AZ outage would take out the entire control plane. With a 3-manager setup, each manager should be in a different zone.

Multi-Zone Deployment

A production Swarm cluster should distribute both managers and workers across multiple failure domains. Label every node with its zone and use placement preferences to spread replicas:

# Label all nodes with their zone
docker node update --label-add zone=us-east-1a manager-01
docker node update --label-add zone=us-east-1a worker-01
docker node update --label-add zone=us-east-1a worker-02

docker node update --label-add zone=us-east-1b manager-02
docker node update --label-add zone=us-east-1b worker-03
docker node update --label-add zone=us-east-1b worker-04

docker node update --label-add zone=us-east-1c manager-03
docker node update --label-add zone=us-east-1c worker-05
docker node update --label-add zone=us-east-1c worker-06

version: "3.8"

services:
  api:
    image: myapp/api:v2.1.0
    deploy:
      replicas: 6
      placement:
        preferences:
          - spread: node.labels.zone
        constraints:
          - node.role == worker
      resources:
        limits:
          cpus: "2.0"
          memory: 1G
        reservations:
          cpus: "0.5"
          memory: 256M

With spread: node.labels.zone and 6 replicas across 3 zones, Swarm will place 2 replicas in each zone. If an entire zone fails, 4 replicas remain running, maintaining 66% capacity.

Anti-Affinity Constraints

Docker Swarm does not have native anti-affinity rules like Kubernetes. However, you can achieve anti-affinity using the spread placement preference combined with node labels:

# Spread across hostnames (one replica per node)
docker service create \
  --name cache \
  --replicas 3 \
  --placement-pref 'spread=node.hostname' \
  redis:7

# Spread across racks
docker node update --label-add rack=rack1 worker-01
docker node update --label-add rack=rack1 worker-02
docker node update --label-add rack=rack2 worker-03
docker node update --label-add rack=rack2 worker-04

docker service create \
  --name api \
  --replicas 4 \
  --placement-pref 'spread=node.labels.rack' \
  myapp/api:latest

For strict anti-affinity (exactly one replica per node), use global mode:

# Exactly one replica on every eligible node
docker service create \
  --name cache \
  --mode global \
  --constraint 'node.labels.cache == true' \
  redis:7

Health Checks and Restart Policies

Health checks and restart policies form the self-healing layer of Swarm HA. Without them, a process that deadlocks or enters an error state will appear "running" to Swarm indefinitely.

Health Check Configuration

version: "3.8"

services:
  api:
    image: myapp/api:v2.1.0
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health/ready"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 30s
    deploy:
      replicas: 6
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 5
        window: 120s

Restart Policy Parameters

Parameter	Description	Recommended
`condition`	`none`, `on-failure`, or `any`	`on-failure` for most services
`delay`	Wait time between restart attempts	5-15s (prevents restart storms)
`max_attempts`	Maximum restart attempts within window	3-5 (avoids infinite restarts)
`window`	Time window for counting restart attempts	120-300s

When max_attempts is reached within the window, the task is marked as failed and Swarm schedules a new task (potentially on a different node). This is the mechanism that migrates workloads away from unhealthy nodes.

Update Strategies for HA

version: "3.8"

services:
  api:
    image: myapp/api:v2.1.0
    deploy:
      replicas: 6
      update_config:
        parallelism: 1
        delay: 30s
        failure_action: rollback
        monitor: 60s
        max_failure_ratio: 0.1
        order: start-first
      rollback_config:
        parallelism: 2
        delay: 5s
        order: stop-first

Key HA decisions in update configuration:

order: start-first ensures new tasks are healthy before old ones are stopped, maintaining full capacity throughout the update
parallelism: 1 with 6 replicas means only ~17% of capacity is in transition at any time
monitor: 60s gives the new task a full minute to prove stability before proceeding
failure_action: rollback automatically reverts if the update fails, preventing a bad deployment from taking down the service

Load Balancer Integration

The Swarm ingress routing mesh provides basic load balancing, but production HA requires an external load balancer for several reasons:

TLS termination at the edge
Health checking of Swarm nodes themselves (not just containers)
Geographic load balancing across regions
Rate limiting and DDoS protection

HAProxy External Load Balancer

# /etc/haproxy/haproxy.cfg
global
    maxconn 50000
    log /dev/log local0

defaults
    mode http
    timeout connect 5s
    timeout client 30s
    timeout server 30s
    option httpchk GET /health
    option forwardfor

frontend http-in
    bind *:80
    bind *:443 ssl crt /etc/ssl/certs/wildcard.pem
    redirect scheme https code 301 if !{ ssl_fc }
    default_backend swarm_nodes

backend swarm_nodes
    balance roundrobin
    option httpchk GET /health HTTP/1.1\r\nHost:\ api.example.com
    http-check expect status 200

    # All Swarm nodes (routing mesh handles internal distribution)
    server node-01 10.0.1.10:80 check inter 5s fall 3 rise 2
    server node-02 10.0.1.11:80 check inter 5s fall 3 rise 2
    server node-03 10.0.1.12:80 check inter 5s fall 3 rise 2
    server node-04 10.0.2.10:80 check inter 5s fall 3 rise 2
    server node-05 10.0.2.11:80 check inter 5s fall 3 rise 2
    server node-06 10.0.2.12:80 check inter 5s fall 3 rise 2

Cloud Load Balancer (AWS ALB)

# Terraform example: ALB targeting all Swarm worker nodes
resource "aws_lb" "swarm" {
  name               = "swarm-alb"
  internal           = false
  load_balancer_type = "application"
  subnets            = var.public_subnets
  security_groups    = [aws_security_group.alb.id]
}

resource "aws_lb_target_group" "swarm" {
  name     = "swarm-targets"
  port     = 80
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 10
    timeout             = 5
  }
}

# Register all Swarm worker nodes
resource "aws_lb_target_group_attachment" "workers" {
  for_each         = toset(var.worker_instance_ids)
  target_group_arn = aws_lb_target_group.swarm.arn
  target_id        = each.value
  port             = 80
}

Tip: Configure your external load balancer to health-check the application endpoint, not just TCP connectivity. A Swarm node might have Docker running but no healthy service tasks. The load balancer should detect this and stop routing traffic to that node.

Failure Scenarios and Recovery

Scenario 1: Single Worker Node Failure

Phase	Duration	What Happens
Detection	5-30 seconds	Manager detects node is unreachable
Task rescheduling	5-10 seconds	Tasks are rescheduled to healthy nodes
Container startup	Application-dependent	New containers pull image and start
Health check pass	start_period + interval	Service is fully recovered

Impact: Temporary reduction in capacity. With 6 replicas across 3 zones, losing 1 node drops capacity to ~66%. The external load balancer stops routing to the failed node within 15-30 seconds.

Scenario 2: Single Manager Failure (Non-Leader)

Impact: None. The cluster continues operating with reduced fault tolerance. Running services are unaffected. Restore the manager node when possible to maintain quorum resilience.

Scenario 3: Leader Manager Failure

Impact: 1-5 second pause in cluster operations while a new leader is elected. Running containers are unaffected. New deployments and service updates are briefly delayed.

# Monitor leader election
docker events --filter type=node | grep "manager"

# Verify new leader was elected
docker node ls --filter role=manager \
  --format '{{.Hostname}} {{.ManagerStatus}}'

Scenario 4: Entire Availability Zone Failure

Impact with proper distribution: Loss of ~33% of nodes. Manager quorum preserved (2 out of 3 managers in remaining zones). Services with spread: node.labels.zone lose ~33% of replicas, which are rescheduled to remaining zones.

# Verify service distribution after zone loss
docker service ps myapp_api \
  --format "table {{.Name}}\t{{.Node}}\t{{.CurrentState}}"

# Scale up temporarily if remaining capacity is insufficient
docker service scale myapp_api=9

Scenario 5: Quorum Loss (Majority of Managers Down)

Impact: Cluster operations halt. Running containers continue running but cannot be managed. No new tasks can be scheduled, no updates can be performed.

# Recovery: force new cluster from surviving manager
docker swarm init --force-new-cluster --advertise-addr 10.0.1.10

# Add new managers immediately
docker swarm join-token manager
# Run join command on new manager nodes

HA Architecture Reference

# Complete HA stack deployment
version: "3.8"

services:
  web:
    image: myapp/web:v2.1.0
    deploy:
      replicas: 4
      placement:
        constraints:
          - node.role == worker
        preferences:
          - spread: node.labels.zone
      update_config:
        parallelism: 1
        delay: 15s
        failure_action: rollback
        monitor: 30s
        order: start-first
      rollback_config:
        parallelism: 2
        delay: 5s
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 5
        window: 120s
      resources:
        limits:
          cpus: "1.0"
          memory: 512M
        reservations:
          cpus: "0.25"
          memory: 128M
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 20s
    ports:
      - "80:8080"
    networks:
      - frontend
      - backend

  api:
    image: myapp/api:v2.1.0
    deploy:
      replicas: 6
      placement:
        constraints:
          - node.role == worker
        preferences:
          - spread: node.labels.zone
      update_config:
        parallelism: 2
        delay: 20s
        failure_action: rollback
        monitor: 45s
        order: start-first
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 5
        window: 120s
      resources:
        limits:
          cpus: "2.0"
          memory: 1G
        reservations:
          cpus: "0.5"
          memory: 256M
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health/ready"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 30s
    networks:
      - backend
      - database
    secrets:
      - db_password
      - api_key

networks:
  frontend:
    driver: overlay
  backend:
    driver: overlay
    driver_opts:
      encrypted: "true"
  database:
    driver: overlay
    driver_opts:
      encrypted: "true"
    internal: true

secrets:
  db_password:
    external: true
  api_key:
    external: true

usulnet enhances Swarm HA by providing a management layer that monitors node health, service distribution, and replica counts across all nodes. When a failure occurs, usulnet shows exactly which services are affected, where tasks have been rescheduled, and whether the cluster has recovered to its desired state. This visibility is essential for operating HA infrastructure with confidence.

HA Checklist

3 or 5 managers distributed across availability zones
Workers spread across zones with node labels for placement preferences
Health checks on every service with appropriate start_period
Restart policies with max_attempts to prevent restart storms
Update configuration with start-first order and automatic rollback
Resource reservations to prevent overcommitment
External load balancer with health checks for TLS termination and node-level failover
Monitoring and alerting for quorum status, node availability, and replica count mismatches
Documented runbooks for every failure scenario listed above
Regular failure testing: periodically drain nodes and kill managers to verify recovery

Conclusion

Docker Swarm HA is not a single feature you enable. It is the combination of proper manager distribution, service placement across failure domains, health checks that detect real failures, restart policies that recover from transient issues, update strategies that never reduce capacity, and external load balancers that route around node failures. Build each layer, test each failure scenario, and maintain runbooks for your on-call team. The goal is for node failures to be operational non-events that resolve themselves.

Manager High Availability: Raft Consensus

How Raft Works in Swarm

Manager Placement Strategy

Multi-Zone Deployment

Anti-Affinity Constraints

Health Checks and Restart Policies

Health Check Configuration

Restart Policy Parameters

Update Strategies for HA

Load Balancer Integration

HAProxy External Load Balancer

Cloud Load Balancer (AWS ALB)

Failure Scenarios and Recovery

Scenario 1: Single Worker Node Failure

Scenario 2: Single Manager Failure (Non-Leader)

Scenario 3: Leader Manager Failure

Scenario 4: Entire Availability Zone Failure

Scenario 5: Quorum Loss (Majority of Managers Down)

HA Architecture Reference

HA Checklist

Conclusion

Related Articles

Docker Swarm in Production: Complete Deployment and Operations Guide

Docker Swarm Troubleshooting: Diagnosing and Fixing Common Issues

Docker Swarm Rolling Updates: Zero-Downtime Deployments Made Simple