Docker Swarm is Docker's native clustering and orchestration solution, built directly into the Docker Engine. Unlike external orchestrators that require separate installation and complex configuration, Swarm mode is activated with a single command and leverages the same Docker CLI you already know. For teams that want container orchestration without the operational overhead of Kubernetes, Swarm provides a compelling path to multi-node deployments.

This tutorial walks through every aspect of Docker Swarm: from initializing your first cluster to deploying production-grade stacks with rolling updates, service discovery, and overlay networking. By the end, you will have a fully functional Swarm cluster capable of running real workloads.

Understanding Swarm Architecture

A Docker Swarm cluster consists of two types of nodes:

Node Type Role Recommended Count
Manager Maintains cluster state, schedules services, serves the Swarm API 3 or 5 (odd number for Raft consensus)
Worker Executes containers assigned by managers As many as your workload requires

Manager nodes use the Raft consensus algorithm to maintain a consistent cluster state. With three managers, the cluster tolerates one manager failure. With five, it tolerates two. Never run an even number of managers—a split-brain scenario becomes possible and the cluster may lose quorum.

Important: Manager nodes also run workloads by default. In production, you may want to drain managers so they only handle orchestration duties, especially in larger clusters.

Prerequisites

Before initializing your Swarm, ensure the following on all nodes:

  • Docker Engine 19.03 or later installed (Swarm mode is built in)
  • The following ports open between all nodes:
    • 2377/tcp — Cluster management communications
    • 7946/tcp + 7946/udp — Node-to-node communication
    • 4789/udp — Overlay network traffic (VXLAN)
  • Stable hostnames or static IPs for manager nodes
  • Time synchronized across all nodes (NTP)

Initializing the Swarm

On your first manager node, initialize the Swarm:

# Initialize on the first manager node
docker swarm init --advertise-addr 192.168.1.10

# Output:
# Swarm initialized: current node (abc123def) is now a manager.
# To add a worker to this swarm, run the following command:
#   docker swarm join --token SWMTKN-1-abc123... 192.168.1.10:2377
# To add a manager to this swarm, run 'docker swarm join-token manager'

The --advertise-addr flag specifies the address other nodes will use to connect. This is critical on multi-homed servers. If your node has only one IP, Docker will auto-detect it.

Adding Worker Nodes

On each worker machine, run the join command provided during initialization:

# On each worker node
docker swarm join --token SWMTKN-1-abc123... 192.168.1.10:2377

# If you lost the token, retrieve it from any manager:
docker swarm join-token worker

Adding Additional Manager Nodes

# Get the manager join token from an existing manager
docker swarm join-token manager

# On the new manager node
docker swarm join --token SWMTKN-1-mgr456... 192.168.1.10:2377

Verifying the Cluster

# List all nodes (run on a manager)
docker node ls

# ID                           HOSTNAME    STATUS  AVAILABILITY  MANAGER STATUS
# abc123def *                  manager-1   Ready   Active        Leader
# def456ghi                    manager-2   Ready   Active        Reachable
# ghi789jkl                    manager-3   Ready   Active        Reachable
# jkl012mno                    worker-1    Ready   Active
# mno345pqr                    worker-2    Ready   Active

Deploying Your First Service

Services are the fundamental deployment unit in Swarm. A service defines which container image to run, how many replicas to maintain, and how to expose the application to the network.

# Create a simple nginx service with 3 replicas
docker service create \
  --name web \
  --replicas 3 \
  --publish published=80,target=80 \
  nginx:alpine

# List running services
docker service ls

# ID            NAME  MODE        REPLICAS  IMAGE
# r5s3k7p2q1    web   replicated  3/3       nginx:alpine

# See where replicas are running
docker service ps web

# ID            NAME    IMAGE          NODE       DESIRED STATE  CURRENT STATE
# a1b2c3d4e5    web.1   nginx:alpine   worker-1   Running        Running 30 seconds ago
# f6g7h8i9j0    web.2   nginx:alpine   worker-2   Running        Running 28 seconds ago
# k1l2m3n4o5    web.3   nginx:alpine   manager-1  Running        Running 29 seconds ago

Scaling Services

# Scale up to 5 replicas
docker service scale web=5

# Scale multiple services at once
docker service scale web=5 api=3 cache=2

# Watch the scaling happen in real-time
docker service ps web --filter "desired-state=running"

Inspecting Services

# Detailed service information
docker service inspect --pretty web

# View service logs (aggregated from all replicas)
docker service logs web
docker service logs web --follow --tail 100

Rolling Updates

Swarm provides built-in rolling updates that replace containers one (or more) at a time, with configurable delays and failure thresholds:

# Update the image with a rolling update
docker service update \
  --image nginx:1.25-alpine \
  --update-parallelism 2 \
  --update-delay 10s \
  --update-failure-action rollback \
  --update-max-failure-ratio 0.25 \
  web

These parameters mean:

  • --update-parallelism 2 — Update 2 replicas at a time
  • --update-delay 10s — Wait 10 seconds between batches
  • --update-failure-action rollback — Automatically rollback if updates fail
  • --update-max-failure-ratio 0.25 — Tolerate up to 25% failures before triggering rollback
# Manually rollback to the previous version
docker service rollback web

# Check rollback status
docker service ps web
Tip: Always set --update-failure-action rollback in production. Without it, a bad image will replace all your healthy containers one by one until the entire service is down.

Overlay Networking

Overlay networks enable containers on different nodes to communicate as if they were on the same local network. Swarm handles the VXLAN encapsulation transparently.

# Create an overlay network
docker network create \
  --driver overlay \
  --subnet 10.0.10.0/24 \
  --attachable \
  app-network

# Create services on the same overlay network
docker service create \
  --name api \
  --network app-network \
  --replicas 3 \
  myapp/api:latest

docker service create \
  --name postgres \
  --network app-network \
  --replicas 1 \
  --mount type=volume,source=pgdata,target=/var/lib/postgresql/data \
  postgres:16-alpine

The --attachable flag allows standalone containers (not just services) to join the overlay network, which is useful during debugging.

Service Discovery

Swarm provides built-in DNS-based service discovery. Every service gets a DNS entry that resolves to the virtual IP (VIP) of the service, which load-balances across all healthy replicas:

# From inside any container on the same network:
# 'postgres' resolves to the VIP of the postgres service
# 'api' resolves to the VIP of the api service

# You can also use tasks. to resolve individual task IPs
nslookup tasks.api
# Returns individual IP for each replica

Ingress Routing Mesh

When you publish a port, Swarm creates an ingress routing mesh. Any node in the cluster can accept traffic on that port, even if it is not running a replica of the service. The mesh routes the request to an available replica:

# Published port 80 is accessible on ALL nodes
docker service create \
  --name web \
  --publish published=80,target=80 \
  --replicas 3 \
  nginx:alpine

# Hitting ANY node IP on port 80 reaches the service:
curl http://192.168.1.10  # manager-1
curl http://192.168.1.11  # manager-2
curl http://192.168.1.20  # worker-1 (even if no replica runs here)

For bypassing the mesh and binding directly to the host, use mode=host:

docker service create \
  --name web-direct \
  --publish published=80,target=80,mode=host \
  --mode global \
  nginx:alpine

Stack Deploy with Compose Files

For production deployments, define your entire application in a Compose file and deploy it as a stack. This is the recommended way to manage Swarm services:

# docker-stack.yml
version: "3.8"

services:
  web:
    image: myapp/web:2.1.0
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: rollback
      rollback_config:
        parallelism: 1
        delay: 5s
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
      resources:
        limits:
          cpus: "0.50"
          memory: 256M
        reservations:
          cpus: "0.25"
          memory: 128M
      placement:
        constraints:
          - node.role == worker
    ports:
      - "80:8080"
    networks:
      - frontend
      - backend

  api:
    image: myapp/api:2.1.0
    deploy:
      replicas: 2
      update_config:
        parallelism: 1
        delay: 15s
        failure_action: rollback
      placement:
        constraints:
          - node.role == worker
    environment:
      DATABASE_URL: postgres://app:secret@db:5432/myapp
      REDIS_URL: redis://cache:6379
    networks:
      - backend
    secrets:
      - db_password
      - api_key

  db:
    image: postgres:16-alpine
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.storage == ssd
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    networks:
      - backend
    secrets:
      - db_password

  cache:
    image: redis:7-alpine
    deploy:
      replicas: 1
    networks:
      - backend

networks:
  frontend:
    driver: overlay
  backend:
    driver: overlay
    internal: true

volumes:
  pgdata:

secrets:
  db_password:
    external: true
  api_key:
    external: true
# Create secrets first
echo "supersecretpassword" | docker secret create db_password -
echo "my-api-key-value" | docker secret create api_key -

# Deploy the stack
docker stack deploy -c docker-stack.yml myapp

# List stacks
docker stack ls

# List services in a stack
docker stack services myapp

# View tasks across all services
docker stack ps myapp

# Remove a stack
docker stack rm myapp

Placement Constraints and Preferences

Control where services run using labels and constraints:

# Label nodes
docker node update --label-add storage=ssd worker-1
docker node update --label-add storage=hdd worker-2
docker node update --label-add region=us-east worker-1
docker node update --label-add region=us-west worker-2

# Constrain service to SSD nodes
docker service create \
  --name db \
  --constraint 'node.labels.storage == ssd' \
  postgres:16

# Spread across regions (soft preference)
docker service create \
  --name web \
  --replicas 4 \
  --placement-pref 'spread=node.labels.region' \
  nginx:alpine

Managing Secrets and Configs

Swarm provides encrypted secret management and configuration objects:

# Create a secret from a file
docker secret create tls_cert ./server.crt
docker secret create tls_key ./server.key

# Create a config object
docker config create nginx_conf ./nginx.conf

# Use in a service
docker service create \
  --name proxy \
  --secret tls_cert \
  --secret tls_key \
  --config source=nginx_conf,target=/etc/nginx/nginx.conf \
  nginx:alpine

# Secrets are mounted at /run/secrets/ inside the container
# They are stored encrypted in the Raft log and only sent to nodes
# that need them
Warning: Secrets are only available to Swarm services, not standalone containers. If you run docker run with --secret, it will fail. Use docker service create or stack deploy instead.

Health Checks and Self-Healing

Swarm uses health checks to determine whether a container is ready to receive traffic. Unhealthy containers are stopped and replaced automatically:

docker service create \
  --name api \
  --replicas 3 \
  --health-cmd "curl -f http://localhost:8080/health || exit 1" \
  --health-interval 10s \
  --health-timeout 5s \
  --health-retries 3 \
  --health-start-period 30s \
  myapp/api:latest

The --health-start-period gives the container time to start up before health checks are counted against it. This is critical for applications with slow startup times like Java services.

Draining Nodes for Maintenance

# Drain a node (existing tasks are moved to other nodes)
docker node update --availability drain worker-1

# Perform maintenance on worker-1...

# Bring it back
docker node update --availability active worker-1

# Pause a node (no new tasks, existing tasks keep running)
docker node update --availability pause worker-2

Monitoring Your Swarm

Keeping visibility into your Swarm cluster is essential. Use the built-in commands alongside external monitoring tools:

# Cluster-wide view
docker node ls
docker service ls
docker stack ps myapp --filter "desired-state=running"

# Node-level resource usage
docker node ps worker-1

# Service-level logs
docker service logs myapp_web --since 1h --follow

# System-wide events
docker events --filter type=service --since 1h

For production clusters, tools like usulnet provide a centralized dashboard where you can monitor all Swarm services, view logs, and manage deployments across multiple nodes without switching between terminal sessions.

Production Hardening Checklist

  1. Use an odd number of managers (3 for most clusters, 5 for large deployments)
  2. Drain manager nodes in clusters with more than 5 total nodes
  3. Enable autolock to encrypt the Raft log at rest:
    docker swarm update --autolock=true
    # Save the unlock key securely!
    docker swarm unlock-key
  4. Rotate join tokens periodically:
    docker swarm join-token --rotate worker
    docker swarm join-token --rotate manager
  5. Set resource limits on all services to prevent noisy neighbors
  6. Use overlay networks with encryption:
    docker network create --driver overlay --opt encrypted secure-net
  7. Implement health checks on every service
  8. Use secrets instead of environment variables for sensitive data
  9. Back up the Swarm state regularly:
    sudo tar czf swarm-backup.tar.gz /var/lib/docker/swarm

Common Pitfalls

Problem Cause Solution
Service stuck at 0/N replicas Image pull failure or constraint mismatch Check docker service ps --no-trunc <service>
Overlay network unreachable Firewall blocking port 4789/udp Open VXLAN port between all nodes
Cluster lost quorum Majority of managers down Force new cluster: docker swarm init --force-new-cluster
Tasks keep restarting Application crash or OOM kill Check logs and increase memory limits
Stack deploy hangs Secret or config not found Create external secrets/configs before deploying

Conclusion

Docker Swarm remains a powerful and underappreciated orchestration platform. Its tight integration with Docker, zero-dependency setup, and intuitive service model make it an excellent choice for teams that need multi-node container orchestration without the complexity of Kubernetes. For small to medium clusters—especially those already invested in the Docker ecosystem—Swarm delivers production-grade orchestration with a fraction of the operational overhead.

Start with a three-node cluster, deploy your first stack, and iterate from there. The Compose-native workflow means you can reuse your existing development Compose files with minimal modifications for Swarm deployment.