Docker Swarm Rolling Updates: Zero-Downtime Deployments Made Simple
Deploying a new version of an application without dropping a single request is the gold standard of production operations. Docker Swarm makes this achievable out of the box with its rolling update mechanism. Unlike Kubernetes, where achieving true zero-downtime requires configuring readiness probes, PodDisruptionBudgets, and preStop hooks, Swarm's approach is simpler: define your update configuration, set a health check, and let the orchestrator handle the rest.
But "simple" does not mean "automatic." The default update settings are conservative to the point of being slow, and without proper health checks, Swarm will happily roll out a broken image across your entire cluster. This guide covers every knob you can turn, the patterns that work in practice, and the failure modes you need to prepare for.
Understanding Update Configuration
Every Swarm service has two configuration blocks that control deployment behavior: update_config and rollback_config. Here is every parameter and what it does:
| Parameter | Default | Description |
|---|---|---|
parallelism |
1 | Number of tasks to update simultaneously |
delay |
0s | Wait time between updating batches |
failure_action |
pause | What to do when an update fails: pause, continue, or rollback |
monitor |
5s | Duration to monitor a task after update before considering it stable |
max_failure_ratio |
0 | Fraction of tasks that can fail before triggering failure_action |
order |
stop-first | stop-first (stop old, start new) or start-first (start new, stop old) |
The Critical Parameters
parallelism controls how many tasks are updated at once. For a service with 10 replicas and parallelism of 2, Swarm updates 2 tasks at a time. Higher parallelism means faster deployments but also a larger blast radius if something goes wrong.
delay is the pause between batches. Setting this to 10-30 seconds gives your monitoring system time to detect issues before the next batch begins.
failure_action: rollback is the most important setting for production. Without it, a failed update pauses and waits for manual intervention. With rollback, Swarm automatically reverts to the previous version.
order: start-first is essential for zero-downtime. It starts the new task before stopping the old one, ensuring there is always capacity to handle requests. The downside is that you temporarily run more tasks than your replica count, which requires available resources on the node.
# Production-ready update configuration via CLI
docker service create \
--name api \
--replicas 6 \
--update-parallelism 2 \
--update-delay 15s \
--update-failure-action rollback \
--update-monitor 30s \
--update-max-failure-ratio 0.25 \
--update-order start-first \
--rollback-parallelism 2 \
--rollback-delay 5s \
--rollback-monitor 10s \
--rollback-max-failure-ratio 0 \
--rollback-order stop-first \
myapp/api:v1.0.0
Stack File Configuration
version: "3.8"
services:
api:
image: myapp/api:v2.1.0
deploy:
replicas: 6
update_config:
parallelism: 2
delay: 15s
failure_action: rollback
monitor: 30s
max_failure_ratio: 0.25
order: start-first
rollback_config:
parallelism: 2
delay: 5s
monitor: 10s
max_failure_ratio: 0
order: stop-first
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120s
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 30s
Health Check Integration
Health checks are the lynchpin of reliable rolling updates. Without them, Swarm considers a container healthy the moment it starts. A container that crashes during initialization, fails to connect to its database, or listens on the wrong port will be counted as "running" and Swarm will proceed to update the next batch.
Writing Effective Health Checks
# Simple HTTP health check
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 30s
# Health check that verifies database connectivity
healthcheck:
test: ["CMD", "python", "-c",
"import urllib.request; urllib.request.urlopen('http://localhost:8080/health/ready')"]
interval: 15s
timeout: 10s
retries: 3
start_period: 45s
# PostgreSQL health check
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
# Redis health check
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 3
/health/ready endpoint checks database connectivity, cache availability, and any other critical dependencies. Return HTTP 200 only when the service is fully operational.
Health Check Timing Matters
The start_period is the grace period during which health check failures do not count. This is critical for applications with slow startup (JVM warm-up, large dependency injection graphs, database migrations).
The formula for maximum time before Swarm declares a task unhealthy:
# Time to declare unhealthy = start_period + (interval * retries)
# Example: 30s + (10s * 3) = 60 seconds maximum startup time
# For a Java application with slow startup:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/actuator/health"]
interval: 15s
timeout: 10s
retries: 5
start_period: 120s
# Maximum startup: 120s + (15s * 5) = 195 seconds
Performing a Rolling Update
# Update the image version
docker service update \
--image myapp/api:v2.2.0 \
myapp_api
# Monitor the update in real time
watch docker service ps myapp_api
# Or use --detach=false to wait for completion
docker service update \
--image myapp/api:v2.2.0 \
--detach=false \
myapp_api
For stack deployments, simply update the image tag in your Compose file and redeploy:
# Edit docker-compose.yml: change image to v2.2.0
docker stack deploy -c docker-compose.yml myapp
What Happens During an Update
With parallelism: 2, delay: 15s, and order: start-first:
- Swarm starts 2 new tasks with the updated image
- Waits for them to pass health checks
- Once healthy, stops 2 old tasks
- Monitors the new tasks for the
monitorduration (30s) - If stable, waits
delayseconds (15s) - Repeats with the next batch of 2
- If any task fails during
monitor, triggersfailure_action
Rollback Procedures
Automatic Rollback
With failure_action: rollback, Swarm automatically reverts when an update fails. The rollback uses the rollback_config parameters:
# Swarm detects update failure and automatically rolls back
# You see this in service ps output:
docker service ps myapp_api
# ID NAME IMAGE NODE STATE
# abc123 myapp_api.1 api:v2.1.0 worker-01 Running (rollback from v2.2.0)
# def456 myapp_api.2 api:v2.1.0 worker-02 Running (rollback from v2.2.0)
Manual Rollback
# Roll back to the previous version
docker service rollback myapp_api
# Or specify an image explicitly
docker service update --image myapp/api:v2.1.0 myapp_api
Checking Update Status
# Check if an update is in progress
docker service inspect --pretty myapp_api | grep -A10 "UpdateStatus"
# Output:
# UpdateStatus:
# State: completed
# Started: 2025-04-05T10:15:00.000000000Z
# Completed: 2025-04-05T10:17:30.000000000Z
# Message: update completed
# For failed updates:
# UpdateStatus:
# State: rollback_completed
# Started: 2025-04-05T10:15:00.000000000Z
# Completed: 2025-04-05T10:16:00.000000000Z
# Message: rollback completed
Blue-Green Deployments with Swarm
While Swarm does not natively support blue-green deployments, you can implement the pattern using two separate services and a reverse proxy:
version: "3.8"
services:
# Blue environment (current production)
api-blue:
image: myapp/api:v2.1.0
deploy:
replicas: 4
labels:
- "traefik.enable=true"
- "traefik.http.routers.api.rule=Host(`api.example.com`)"
- "traefik.http.services.api-blue.loadbalancer.server.port=8080"
networks:
- proxy
# Green environment (new version)
api-green:
image: myapp/api:v2.2.0
deploy:
replicas: 4
labels:
- "traefik.enable=false" # Not receiving traffic yet
networks:
- proxy
networks:
proxy:
driver: overlay
# Step 1: Deploy green with traffic disabled
docker stack deploy -c docker-compose.yml myapp
# Step 2: Test green environment directly
curl http://api-green:8080/health
# Step 3: Switch traffic by updating labels
docker service update \
--label-add "traefik.enable=true" \
--label-add "traefik.http.routers.api.rule=Host(\`api.example.com\`)" \
myapp_api-green
docker service update \
--label-add "traefik.enable=false" \
myapp_api-blue
# Step 4: Monitor, then remove blue
docker service rm myapp_api-blue
Canary Deployments
Canary deployments send a small percentage of traffic to the new version before rolling out fully. In Swarm, you can achieve this by running two services with different replica counts:
version: "3.8"
services:
# Stable: 9 replicas (90% of traffic)
api-stable:
image: myapp/api:v2.1.0
deploy:
replicas: 9
networks:
- backend
# Canary: 1 replica (10% of traffic)
api-canary:
image: myapp/api:v2.2.0
deploy:
replicas: 1
networks:
- backend
# Load balancer routes to both services
nginx:
image: nginx:latest
configs:
- source: nginx_canary
target: /etc/nginx/conf.d/default.conf
ports:
- "80:80"
networks:
- backend
networks:
backend:
driver: overlay
configs:
nginx_canary:
file: ./nginx-canary.conf
# nginx-canary.conf
upstream api {
# Both services resolve via Swarm DNS
# Weight is controlled by replica count
server api-stable:8080;
server api-canary:8080;
}
server {
listen 80;
location / {
proxy_pass http://api;
}
}
Canary Promotion Script
#!/bin/bash
# canary-promote.sh - Gradually shift traffic to the canary
set -euo pipefail
STABLE_SERVICE="myapp_api-stable"
CANARY_SERVICE="myapp_api-canary"
TOTAL_REPLICAS=10
# Phase 1: 10% canary
echo "Phase 1: 10% canary"
docker service scale "$STABLE_SERVICE=9" "$CANARY_SERVICE=1"
echo "Waiting 5 minutes for monitoring..."
sleep 300
# Check error rate (integrate with your monitoring)
ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_errors_total{service="canary"}[5m])' | jq -r '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "Error rate too high ($ERROR_RATE). Rolling back."
docker service scale "$STABLE_SERVICE=10" "$CANARY_SERVICE=0"
exit 1
fi
# Phase 2: 50% canary
echo "Phase 2: 50% canary"
docker service scale "$STABLE_SERVICE=5" "$CANARY_SERVICE=5"
sleep 300
# Phase 3: Full promotion
echo "Phase 3: Full promotion"
docker service update --image myapp/api:v2.2.0 "$STABLE_SERVICE"
docker service scale "$CANARY_SERVICE=0"
echo "Canary promotion complete."
Monitoring Updates
# Watch update progress in real time
watch -n 2 'docker service ps myapp_api --format "table {{.Name}}\t{{.Image}}\t{{.CurrentState}}\t{{.Error}}" | head -20'
# Check for tasks in a failed state
docker service ps myapp_api --filter "desired-state=shutdown" \
--format "{{.Name}} {{.Error}}"
# Get update status
docker service inspect myapp_api \
--format '{{.UpdateStatus.State}} - {{.UpdateStatus.Message}}'
# Event stream for a service
docker events --filter type=service --filter service=myapp_api
docker service ps --no-trunc myapp_api to see full error messages.
Production Update Strategy Matrix
| Service Type | Parallelism | Delay | Order | Failure Action |
|---|---|---|---|---|
| Stateless API (6+ replicas) | 2 | 15s | start-first | rollback |
| Web frontend | 2 | 10s | start-first | rollback |
| Background worker | 1 | 30s | stop-first | rollback |
| Database (single replica) | 1 | 0s | stop-first | pause |
| Global service (monitoring) | 1 | 30s | stop-first | pause |
Platforms like usulnet give you a visual deployment dashboard where you can watch rolling updates progress across your Swarm cluster in real time, trigger rollbacks with a single click, and see the health status of every task during the update process. This beats watching docker service ps output in a terminal.
Conclusion
Zero-downtime deployments in Docker Swarm require three things: proper update configuration, reliable health checks, and a rollback strategy. Set failure_action: rollback, use order: start-first for stateless services, and always define health checks that verify actual readiness, not just process liveness. With these in place, you can deploy with confidence, knowing that Swarm will automatically revert if anything goes wrong.