DevOps

Docker Swarm Rolling Updates: Zero-Downtime Deployments Made Simple

April 5, 2025 · 17 min read

Deploying a new version of an application without dropping a single request is the gold standard of production operations. Docker Swarm makes this achievable out of the box with its rolling update mechanism. Unlike Kubernetes, where achieving true zero-downtime requires configuring readiness probes, PodDisruptionBudgets, and preStop hooks, Swarm's approach is simpler: define your update configuration, set a health check, and let the orchestrator handle the rest.

But "simple" does not mean "automatic." The default update settings are conservative to the point of being slow, and without proper health checks, Swarm will happily roll out a broken image across your entire cluster. This guide covers every knob you can turn, the patterns that work in practice, and the failure modes you need to prepare for.

Understanding Update Configuration

Every Swarm service has two configuration blocks that control deployment behavior: update_config and rollback_config. Here is every parameter and what it does:

Parameter	Default	Description
`parallelism`	1	Number of tasks to update simultaneously
`delay`	0s	Wait time between updating batches
`failure_action`	pause	What to do when an update fails: `pause`, `continue`, or `rollback`
`monitor`	5s	Duration to monitor a task after update before considering it stable
`max_failure_ratio`	0	Fraction of tasks that can fail before triggering failure_action
`order`	stop-first	`stop-first` (stop old, start new) or `start-first` (start new, stop old)

The Critical Parameters

parallelism controls how many tasks are updated at once. For a service with 10 replicas and parallelism of 2, Swarm updates 2 tasks at a time. Higher parallelism means faster deployments but also a larger blast radius if something goes wrong.

delay is the pause between batches. Setting this to 10-30 seconds gives your monitoring system time to detect issues before the next batch begins.

failure_action: rollback is the most important setting for production. Without it, a failed update pauses and waits for manual intervention. With rollback, Swarm automatically reverts to the previous version.

order: start-first is essential for zero-downtime. It starts the new task before stopping the old one, ensuring there is always capacity to handle requests. The downside is that you temporarily run more tasks than your replica count, which requires available resources on the node.

# Production-ready update configuration via CLI
docker service create \
  --name api \
  --replicas 6 \
  --update-parallelism 2 \
  --update-delay 15s \
  --update-failure-action rollback \
  --update-monitor 30s \
  --update-max-failure-ratio 0.25 \
  --update-order start-first \
  --rollback-parallelism 2 \
  --rollback-delay 5s \
  --rollback-monitor 10s \
  --rollback-max-failure-ratio 0 \
  --rollback-order stop-first \
  myapp/api:v1.0.0

Stack File Configuration

version: "3.8"

services:
  api:
    image: myapp/api:v2.1.0
    deploy:
      replicas: 6
      update_config:
        parallelism: 2
        delay: 15s
        failure_action: rollback
        monitor: 30s
        max_failure_ratio: 0.25
        order: start-first
      rollback_config:
        parallelism: 2
        delay: 5s
        monitor: 10s
        max_failure_ratio: 0
        order: stop-first
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 30s

Health Check Integration

Health checks are the lynchpin of reliable rolling updates. Without them, Swarm considers a container healthy the moment it starts. A container that crashes during initialization, fails to connect to its database, or listens on the wrong port will be counted as "running" and Swarm will proceed to update the next batch.

Writing Effective Health Checks

# Simple HTTP health check
healthcheck:
  test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]
  interval: 10s
  timeout: 5s
  retries: 3
  start_period: 30s

# Health check that verifies database connectivity
healthcheck:
  test: ["CMD", "python", "-c",
    "import urllib.request; urllib.request.urlopen('http://localhost:8080/health/ready')"]
  interval: 15s
  timeout: 10s
  retries: 3
  start_period: 45s

# PostgreSQL health check
healthcheck:
  test: ["CMD-SHELL", "pg_isready -U postgres"]
  interval: 10s
  timeout: 5s
  retries: 5
  start_period: 30s

# Redis health check
healthcheck:
  test: ["CMD", "redis-cli", "ping"]
  interval: 10s
  timeout: 3s
  retries: 3

Tip: Your health check endpoint should verify that the application is actually ready to serve traffic, not just that the process is running. A good /health/ready endpoint checks database connectivity, cache availability, and any other critical dependencies. Return HTTP 200 only when the service is fully operational.

Health Check Timing Matters

The start_period is the grace period during which health check failures do not count. This is critical for applications with slow startup (JVM warm-up, large dependency injection graphs, database migrations).

The formula for maximum time before Swarm declares a task unhealthy:

# Time to declare unhealthy = start_period + (interval * retries)
# Example: 30s + (10s * 3) = 60 seconds maximum startup time

# For a Java application with slow startup:
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8080/actuator/health"]
  interval: 15s
  timeout: 10s
  retries: 5
  start_period: 120s
# Maximum startup: 120s + (15s * 5) = 195 seconds

Performing a Rolling Update

# Update the image version
docker service update \
  --image myapp/api:v2.2.0 \
  myapp_api

# Monitor the update in real time
watch docker service ps myapp_api

# Or use --detach=false to wait for completion
docker service update \
  --image myapp/api:v2.2.0 \
  --detach=false \
  myapp_api

For stack deployments, simply update the image tag in your Compose file and redeploy:

# Edit docker-compose.yml: change image to v2.2.0
docker stack deploy -c docker-compose.yml myapp

What Happens During an Update

With parallelism: 2, delay: 15s, and order: start-first:

Swarm starts 2 new tasks with the updated image
Waits for them to pass health checks
Once healthy, stops 2 old tasks
Monitors the new tasks for the monitor duration (30s)
If stable, waits delay seconds (15s)
Repeats with the next batch of 2
If any task fails during monitor, triggers failure_action

Rollback Procedures

Automatic Rollback

With failure_action: rollback, Swarm automatically reverts when an update fails. The rollback uses the rollback_config parameters:

# Swarm detects update failure and automatically rolls back
# You see this in service ps output:
docker service ps myapp_api
# ID         NAME              IMAGE           NODE       STATE
# abc123     myapp_api.1       api:v2.1.0      worker-01  Running     (rollback from v2.2.0)
# def456     myapp_api.2       api:v2.1.0      worker-02  Running     (rollback from v2.2.0)

Manual Rollback

# Roll back to the previous version
docker service rollback myapp_api

# Or specify an image explicitly
docker service update --image myapp/api:v2.1.0 myapp_api

Checking Update Status

# Check if an update is in progress
docker service inspect --pretty myapp_api | grep -A10 "UpdateStatus"

# Output:
# UpdateStatus:
#  State:         completed
#  Started:       2025-04-05T10:15:00.000000000Z
#  Completed:     2025-04-05T10:17:30.000000000Z
#  Message:       update completed

# For failed updates:
# UpdateStatus:
#  State:         rollback_completed
#  Started:       2025-04-05T10:15:00.000000000Z
#  Completed:     2025-04-05T10:16:00.000000000Z
#  Message:       rollback completed

Blue-Green Deployments with Swarm

While Swarm does not natively support blue-green deployments, you can implement the pattern using two separate services and a reverse proxy:

version: "3.8"

services:
  # Blue environment (current production)
  api-blue:
    image: myapp/api:v2.1.0
    deploy:
      replicas: 4
      labels:
        - "traefik.enable=true"
        - "traefik.http.routers.api.rule=Host(`api.example.com`)"
        - "traefik.http.services.api-blue.loadbalancer.server.port=8080"
    networks:
      - proxy

  # Green environment (new version)
  api-green:
    image: myapp/api:v2.2.0
    deploy:
      replicas: 4
      labels:
        - "traefik.enable=false"  # Not receiving traffic yet
    networks:
      - proxy

networks:
  proxy:
    driver: overlay

# Step 1: Deploy green with traffic disabled
docker stack deploy -c docker-compose.yml myapp

# Step 2: Test green environment directly
curl http://api-green:8080/health

# Step 3: Switch traffic by updating labels
docker service update \
  --label-add "traefik.enable=true" \
  --label-add "traefik.http.routers.api.rule=Host(\`api.example.com\`)" \
  myapp_api-green

docker service update \
  --label-add "traefik.enable=false" \
  myapp_api-blue

# Step 4: Monitor, then remove blue
docker service rm myapp_api-blue

Canary Deployments

Canary deployments send a small percentage of traffic to the new version before rolling out fully. In Swarm, you can achieve this by running two services with different replica counts:

version: "3.8"

services:
  # Stable: 9 replicas (90% of traffic)
  api-stable:
    image: myapp/api:v2.1.0
    deploy:
      replicas: 9
    networks:
      - backend

  # Canary: 1 replica (10% of traffic)
  api-canary:
    image: myapp/api:v2.2.0
    deploy:
      replicas: 1
    networks:
      - backend

  # Load balancer routes to both services
  nginx:
    image: nginx:latest
    configs:
      - source: nginx_canary
        target: /etc/nginx/conf.d/default.conf
    ports:
      - "80:80"
    networks:
      - backend

networks:
  backend:
    driver: overlay

configs:
  nginx_canary:
    file: ./nginx-canary.conf

# nginx-canary.conf
upstream api {
    # Both services resolve via Swarm DNS
    # Weight is controlled by replica count
    server api-stable:8080;
    server api-canary:8080;
}

server {
    listen 80;
    location / {
        proxy_pass http://api;
    }
}

Tip: For more sophisticated canary deployments with percentage-based routing, use Traefik with weighted round-robin. Swarm's DNS-based routing distributes traffic roughly proportional to replica count, which gives you coarse-grained canary control without external tools.

Canary Promotion Script

#!/bin/bash
# canary-promote.sh - Gradually shift traffic to the canary
set -euo pipefail

STABLE_SERVICE="myapp_api-stable"
CANARY_SERVICE="myapp_api-canary"
TOTAL_REPLICAS=10

# Phase 1: 10% canary
echo "Phase 1: 10% canary"
docker service scale "$STABLE_SERVICE=9" "$CANARY_SERVICE=1"
echo "Waiting 5 minutes for monitoring..."
sleep 300

# Check error rate (integrate with your monitoring)
ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_errors_total{service="canary"}[5m])' | jq -r '.data.result[0].value[1]')

if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
  echo "Error rate too high ($ERROR_RATE). Rolling back."
  docker service scale "$STABLE_SERVICE=10" "$CANARY_SERVICE=0"
  exit 1
fi

# Phase 2: 50% canary
echo "Phase 2: 50% canary"
docker service scale "$STABLE_SERVICE=5" "$CANARY_SERVICE=5"
sleep 300

# Phase 3: Full promotion
echo "Phase 3: Full promotion"
docker service update --image myapp/api:v2.2.0 "$STABLE_SERVICE"
docker service scale "$CANARY_SERVICE=0"

echo "Canary promotion complete."

Monitoring Updates

# Watch update progress in real time
watch -n 2 'docker service ps myapp_api --format "table {{.Name}}\t{{.Image}}\t{{.CurrentState}}\t{{.Error}}" | head -20'

# Check for tasks in a failed state
docker service ps myapp_api --filter "desired-state=shutdown" \
  --format "{{.Name}} {{.Error}}"

# Get update status
docker service inspect myapp_api \
  --format '{{.UpdateStatus.State}} - {{.UpdateStatus.Message}}'

# Event stream for a service
docker events --filter type=service --filter service=myapp_api

Warning: If a rolling update appears stuck, check whether new tasks are failing health checks. A common cause is the new image version requiring an environment variable or secret that was not added to the updated service definition. Use docker service ps --no-trunc myapp_api to see full error messages.

Production Update Strategy Matrix

Service Type	Parallelism	Delay	Order	Failure Action
Stateless API (6+ replicas)	2	15s	start-first	rollback
Web frontend	2	10s	start-first	rollback
Background worker	1	30s	stop-first	rollback
Database (single replica)	1	0s	stop-first	pause
Global service (monitoring)	1	30s	stop-first	pause

Platforms like usulnet give you a visual deployment dashboard where you can watch rolling updates progress across your Swarm cluster in real time, trigger rollbacks with a single click, and see the health status of every task during the update process. This beats watching docker service ps output in a terminal.

Conclusion

Zero-downtime deployments in Docker Swarm require three things: proper update configuration, reliable health checks, and a rollback strategy. Set failure_action: rollback, use order: start-first for stateless services, and always define health checks that verify actual readiness, not just process liveness. With these in place, you can deploy with confidence, knowing that Swarm will automatically revert if anything goes wrong.

Understanding Update Configuration

The Critical Parameters

Stack File Configuration

Health Check Integration

Writing Effective Health Checks

Health Check Timing Matters

Performing a Rolling Update

What Happens During an Update

Rollback Procedures

Automatic Rollback

Manual Rollback

Checking Update Status

Blue-Green Deployments with Swarm

Canary Deployments

Canary Promotion Script

Monitoring Updates

Production Update Strategy Matrix

Conclusion

Related Articles

CI/CD with Docker Swarm: Automated Deployments from Git to Production

Monitoring Docker Swarm: Metrics, Logging and Alerting at Scale

Docker Health Checks: Keeping Your Containers Alive and Healthy