Docker Swarm High Availability: Building Resilient Container Infrastructure
High availability in Docker Swarm means your services survive node failures, network partitions, and rolling updates without dropping user requests. Swarm provides the building blocks -- Raft consensus for manager HA, automatic task rescheduling, health checks, and restart policies -- but assembling them into a truly resilient architecture requires deliberate design.
This guide covers every layer of HA: the control plane (manager quorum), the data plane (service distribution), application-level resilience (health checks and restart policies), external load balancer integration, and planning for the failure scenarios that will inevitably occur.
Manager High Availability: Raft Consensus
The Swarm control plane is a distributed state machine built on the Raft consensus algorithm. All cluster state (services, tasks, secrets, configs, networks) is stored in the Raft log, which is replicated across all manager nodes.
How Raft Works in Swarm
- One manager is elected as the leader. All write operations go through the leader.
- The leader replicates every write to the follower managers via the Raft log.
- A write is committed only when a majority (quorum) of managers acknowledge it.
- If the leader fails, the remaining managers hold an election. The first to receive votes from a majority becomes the new leader.
- During election, the cluster is briefly read-only (typically under 5 seconds). Running containers are not affected.
| Managers | Quorum | Tolerated Failures | Election Latency |
|---|---|---|---|
| 1 | 1 | 0 | N/A |
| 3 | 2 | 1 | 1-5 seconds |
| 5 | 3 | 2 | 1-5 seconds |
| 7 | 4 | 3 | 2-10 seconds |
Critical rule: Always use an odd number of managers. With an even number, you gain no additional fault tolerance but increase the Raft replication overhead. Three managers tolerate 1 failure; four managers also tolerate only 1 failure (quorum is 3 out of 4). You pay for more overhead with no benefit.
Manager Placement Strategy
Distribute managers across failure domains to maximize resilience:
# 3-manager deployment across 3 availability zones
docker node update --label-add zone=us-east-1a manager-01
docker node update --label-add zone=us-east-1b manager-02
docker node update --label-add zone=us-east-1c manager-03
# 5-manager deployment across 3 zones (2-2-1 split)
# Zone A: manager-01, manager-04
# Zone B: manager-02, manager-05
# Zone C: manager-03
# Loss of any single zone preserves quorum
Multi-Zone Deployment
A production Swarm cluster should distribute both managers and workers across multiple failure domains. Label every node with its zone and use placement preferences to spread replicas:
# Label all nodes with their zone
docker node update --label-add zone=us-east-1a manager-01
docker node update --label-add zone=us-east-1a worker-01
docker node update --label-add zone=us-east-1a worker-02
docker node update --label-add zone=us-east-1b manager-02
docker node update --label-add zone=us-east-1b worker-03
docker node update --label-add zone=us-east-1b worker-04
docker node update --label-add zone=us-east-1c manager-03
docker node update --label-add zone=us-east-1c worker-05
docker node update --label-add zone=us-east-1c worker-06
version: "3.8"
services:
api:
image: myapp/api:v2.1.0
deploy:
replicas: 6
placement:
preferences:
- spread: node.labels.zone
constraints:
- node.role == worker
resources:
limits:
cpus: "2.0"
memory: 1G
reservations:
cpus: "0.5"
memory: 256M
With spread: node.labels.zone and 6 replicas across 3 zones, Swarm will place 2 replicas in each zone. If an entire zone fails, 4 replicas remain running, maintaining 66% capacity.
Anti-Affinity Constraints
Docker Swarm does not have native anti-affinity rules like Kubernetes. However, you can achieve anti-affinity using the spread placement preference combined with node labels:
# Spread across hostnames (one replica per node)
docker service create \
--name cache \
--replicas 3 \
--placement-pref 'spread=node.hostname' \
redis:7
# Spread across racks
docker node update --label-add rack=rack1 worker-01
docker node update --label-add rack=rack1 worker-02
docker node update --label-add rack=rack2 worker-03
docker node update --label-add rack=rack2 worker-04
docker service create \
--name api \
--replicas 4 \
--placement-pref 'spread=node.labels.rack' \
myapp/api:latest
For strict anti-affinity (exactly one replica per node), use global mode:
# Exactly one replica on every eligible node
docker service create \
--name cache \
--mode global \
--constraint 'node.labels.cache == true' \
redis:7
Health Checks and Restart Policies
Health checks and restart policies form the self-healing layer of Swarm HA. Without them, a process that deadlocks or enters an error state will appear "running" to Swarm indefinitely.
Health Check Configuration
version: "3.8"
services:
api:
image: myapp/api:v2.1.0
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health/ready"]
interval: 10s
timeout: 5s
retries: 3
start_period: 30s
deploy:
replicas: 6
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 5
window: 120s
Restart Policy Parameters
| Parameter | Description | Recommended |
|---|---|---|
condition |
none, on-failure, or any |
on-failure for most services |
delay |
Wait time between restart attempts | 5-15s (prevents restart storms) |
max_attempts |
Maximum restart attempts within window | 3-5 (avoids infinite restarts) |
window |
Time window for counting restart attempts | 120-300s |
When max_attempts is reached within the window, the task is marked as failed and Swarm schedules a new task (potentially on a different node). This is the mechanism that migrates workloads away from unhealthy nodes.
Update Strategies for HA
version: "3.8"
services:
api:
image: myapp/api:v2.1.0
deploy:
replicas: 6
update_config:
parallelism: 1
delay: 30s
failure_action: rollback
monitor: 60s
max_failure_ratio: 0.1
order: start-first
rollback_config:
parallelism: 2
delay: 5s
order: stop-first
Key HA decisions in update configuration:
order: start-firstensures new tasks are healthy before old ones are stopped, maintaining full capacity throughout the updateparallelism: 1with 6 replicas means only ~17% of capacity is in transition at any timemonitor: 60sgives the new task a full minute to prove stability before proceedingfailure_action: rollbackautomatically reverts if the update fails, preventing a bad deployment from taking down the service
Load Balancer Integration
The Swarm ingress routing mesh provides basic load balancing, but production HA requires an external load balancer for several reasons:
- TLS termination at the edge
- Health checking of Swarm nodes themselves (not just containers)
- Geographic load balancing across regions
- Rate limiting and DDoS protection
HAProxy External Load Balancer
# /etc/haproxy/haproxy.cfg
global
maxconn 50000
log /dev/log local0
defaults
mode http
timeout connect 5s
timeout client 30s
timeout server 30s
option httpchk GET /health
option forwardfor
frontend http-in
bind *:80
bind *:443 ssl crt /etc/ssl/certs/wildcard.pem
redirect scheme https code 301 if !{ ssl_fc }
default_backend swarm_nodes
backend swarm_nodes
balance roundrobin
option httpchk GET /health HTTP/1.1\r\nHost:\ api.example.com
http-check expect status 200
# All Swarm nodes (routing mesh handles internal distribution)
server node-01 10.0.1.10:80 check inter 5s fall 3 rise 2
server node-02 10.0.1.11:80 check inter 5s fall 3 rise 2
server node-03 10.0.1.12:80 check inter 5s fall 3 rise 2
server node-04 10.0.2.10:80 check inter 5s fall 3 rise 2
server node-05 10.0.2.11:80 check inter 5s fall 3 rise 2
server node-06 10.0.2.12:80 check inter 5s fall 3 rise 2
Cloud Load Balancer (AWS ALB)
# Terraform example: ALB targeting all Swarm worker nodes
resource "aws_lb" "swarm" {
name = "swarm-alb"
internal = false
load_balancer_type = "application"
subnets = var.public_subnets
security_groups = [aws_security_group.alb.id]
}
resource "aws_lb_target_group" "swarm" {
name = "swarm-targets"
port = 80
protocol = "HTTP"
vpc_id = var.vpc_id
health_check {
path = "/health"
healthy_threshold = 2
unhealthy_threshold = 3
interval = 10
timeout = 5
}
}
# Register all Swarm worker nodes
resource "aws_lb_target_group_attachment" "workers" {
for_each = toset(var.worker_instance_ids)
target_group_arn = aws_lb_target_group.swarm.arn
target_id = each.value
port = 80
}
Failure Scenarios and Recovery
Scenario 1: Single Worker Node Failure
| Phase | Duration | What Happens |
|---|---|---|
| Detection | 5-30 seconds | Manager detects node is unreachable |
| Task rescheduling | 5-10 seconds | Tasks are rescheduled to healthy nodes |
| Container startup | Application-dependent | New containers pull image and start |
| Health check pass | start_period + interval | Service is fully recovered |
Impact: Temporary reduction in capacity. With 6 replicas across 3 zones, losing 1 node drops capacity to ~66%. The external load balancer stops routing to the failed node within 15-30 seconds.
Scenario 2: Single Manager Failure (Non-Leader)
Impact: None. The cluster continues operating with reduced fault tolerance. Running services are unaffected. Restore the manager node when possible to maintain quorum resilience.
Scenario 3: Leader Manager Failure
Impact: 1-5 second pause in cluster operations while a new leader is elected. Running containers are unaffected. New deployments and service updates are briefly delayed.
# Monitor leader election
docker events --filter type=node | grep "manager"
# Verify new leader was elected
docker node ls --filter role=manager \
--format '{{.Hostname}} {{.ManagerStatus}}'
Scenario 4: Entire Availability Zone Failure
Impact with proper distribution: Loss of ~33% of nodes. Manager quorum preserved (2 out of 3 managers in remaining zones). Services with spread: node.labels.zone lose ~33% of replicas, which are rescheduled to remaining zones.
# Verify service distribution after zone loss
docker service ps myapp_api \
--format "table {{.Name}}\t{{.Node}}\t{{.CurrentState}}"
# Scale up temporarily if remaining capacity is insufficient
docker service scale myapp_api=9
Scenario 5: Quorum Loss (Majority of Managers Down)
Impact: Cluster operations halt. Running containers continue running but cannot be managed. No new tasks can be scheduled, no updates can be performed.
# Recovery: force new cluster from surviving manager
docker swarm init --force-new-cluster --advertise-addr 10.0.1.10
# Add new managers immediately
docker swarm join-token manager
# Run join command on new manager nodes
HA Architecture Reference
# Complete HA stack deployment
version: "3.8"
services:
web:
image: myapp/web:v2.1.0
deploy:
replicas: 4
placement:
constraints:
- node.role == worker
preferences:
- spread: node.labels.zone
update_config:
parallelism: 1
delay: 15s
failure_action: rollback
monitor: 30s
order: start-first
rollback_config:
parallelism: 2
delay: 5s
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 5
window: 120s
resources:
limits:
cpus: "1.0"
memory: 512M
reservations:
cpus: "0.25"
memory: 128M
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 20s
ports:
- "80:8080"
networks:
- frontend
- backend
api:
image: myapp/api:v2.1.0
deploy:
replicas: 6
placement:
constraints:
- node.role == worker
preferences:
- spread: node.labels.zone
update_config:
parallelism: 2
delay: 20s
failure_action: rollback
monitor: 45s
order: start-first
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 5
window: 120s
resources:
limits:
cpus: "2.0"
memory: 1G
reservations:
cpus: "0.5"
memory: 256M
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health/ready"]
interval: 10s
timeout: 5s
retries: 3
start_period: 30s
networks:
- backend
- database
secrets:
- db_password
- api_key
networks:
frontend:
driver: overlay
backend:
driver: overlay
driver_opts:
encrypted: "true"
database:
driver: overlay
driver_opts:
encrypted: "true"
internal: true
secrets:
db_password:
external: true
api_key:
external: true
usulnet enhances Swarm HA by providing a management layer that monitors node health, service distribution, and replica counts across all nodes. When a failure occurs, usulnet shows exactly which services are affected, where tasks have been rescheduled, and whether the cluster has recovered to its desired state. This visibility is essential for operating HA infrastructure with confidence.
HA Checklist
- 3 or 5 managers distributed across availability zones
- Workers spread across zones with node labels for placement preferences
- Health checks on every service with appropriate start_period
- Restart policies with max_attempts to prevent restart storms
- Update configuration with start-first order and automatic rollback
- Resource reservations to prevent overcommitment
- External load balancer with health checks for TLS termination and node-level failover
- Monitoring and alerting for quorum status, node availability, and replica count mismatches
- Documented runbooks for every failure scenario listed above
- Regular failure testing: periodically drain nodes and kill managers to verify recovery
Conclusion
Docker Swarm HA is not a single feature you enable. It is the combination of proper manager distribution, service placement across failure domains, health checks that detect real failures, restart policies that recover from transient issues, update strategies that never reduce capacity, and external load balancers that route around node failures. Build each layer, test each failure scenario, and maintain runbooks for your on-call team. The goal is for node failures to be operational non-events that resolve themselves.