Docker Swarm Tutorial: Complete Guide to Container Orchestration
Docker Swarm is Docker's native clustering and orchestration solution, built directly into the Docker Engine. Unlike external orchestrators that require separate installation and complex configuration, Swarm mode is activated with a single command and leverages the same Docker CLI you already know. For teams that want container orchestration without the operational overhead of Kubernetes, Swarm provides a compelling path to multi-node deployments.
This tutorial walks through every aspect of Docker Swarm: from initializing your first cluster to deploying production-grade stacks with rolling updates, service discovery, and overlay networking. By the end, you will have a fully functional Swarm cluster capable of running real workloads.
Understanding Swarm Architecture
A Docker Swarm cluster consists of two types of nodes:
| Node Type | Role | Recommended Count |
|---|---|---|
| Manager | Maintains cluster state, schedules services, serves the Swarm API | 3 or 5 (odd number for Raft consensus) |
| Worker | Executes containers assigned by managers | As many as your workload requires |
Manager nodes use the Raft consensus algorithm to maintain a consistent cluster state. With three managers, the cluster tolerates one manager failure. With five, it tolerates two. Never run an even number of managers—a split-brain scenario becomes possible and the cluster may lose quorum.
Important: Manager nodes also run workloads by default. In production, you may want to drain managers so they only handle orchestration duties, especially in larger clusters.
Prerequisites
Before initializing your Swarm, ensure the following on all nodes:
- Docker Engine 19.03 or later installed (Swarm mode is built in)
- The following ports open between all nodes:
- 2377/tcp — Cluster management communications
- 7946/tcp + 7946/udp — Node-to-node communication
- 4789/udp — Overlay network traffic (VXLAN)
- Stable hostnames or static IPs for manager nodes
- Time synchronized across all nodes (NTP)
Initializing the Swarm
On your first manager node, initialize the Swarm:
# Initialize on the first manager node
docker swarm init --advertise-addr 192.168.1.10
# Output:
# Swarm initialized: current node (abc123def) is now a manager.
# To add a worker to this swarm, run the following command:
# docker swarm join --token SWMTKN-1-abc123... 192.168.1.10:2377
# To add a manager to this swarm, run 'docker swarm join-token manager'
The --advertise-addr flag specifies the address other nodes will use to connect. This is critical on multi-homed servers. If your node has only one IP, Docker will auto-detect it.
Adding Worker Nodes
On each worker machine, run the join command provided during initialization:
# On each worker node
docker swarm join --token SWMTKN-1-abc123... 192.168.1.10:2377
# If you lost the token, retrieve it from any manager:
docker swarm join-token worker
Adding Additional Manager Nodes
# Get the manager join token from an existing manager
docker swarm join-token manager
# On the new manager node
docker swarm join --token SWMTKN-1-mgr456... 192.168.1.10:2377
Verifying the Cluster
# List all nodes (run on a manager)
docker node ls
# ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
# abc123def * manager-1 Ready Active Leader
# def456ghi manager-2 Ready Active Reachable
# ghi789jkl manager-3 Ready Active Reachable
# jkl012mno worker-1 Ready Active
# mno345pqr worker-2 Ready Active
Deploying Your First Service
Services are the fundamental deployment unit in Swarm. A service defines which container image to run, how many replicas to maintain, and how to expose the application to the network.
# Create a simple nginx service with 3 replicas
docker service create \
--name web \
--replicas 3 \
--publish published=80,target=80 \
nginx:alpine
# List running services
docker service ls
# ID NAME MODE REPLICAS IMAGE
# r5s3k7p2q1 web replicated 3/3 nginx:alpine
# See where replicas are running
docker service ps web
# ID NAME IMAGE NODE DESIRED STATE CURRENT STATE
# a1b2c3d4e5 web.1 nginx:alpine worker-1 Running Running 30 seconds ago
# f6g7h8i9j0 web.2 nginx:alpine worker-2 Running Running 28 seconds ago
# k1l2m3n4o5 web.3 nginx:alpine manager-1 Running Running 29 seconds ago
Scaling Services
# Scale up to 5 replicas
docker service scale web=5
# Scale multiple services at once
docker service scale web=5 api=3 cache=2
# Watch the scaling happen in real-time
docker service ps web --filter "desired-state=running"
Inspecting Services
# Detailed service information
docker service inspect --pretty web
# View service logs (aggregated from all replicas)
docker service logs web
docker service logs web --follow --tail 100
Rolling Updates
Swarm provides built-in rolling updates that replace containers one (or more) at a time, with configurable delays and failure thresholds:
# Update the image with a rolling update
docker service update \
--image nginx:1.25-alpine \
--update-parallelism 2 \
--update-delay 10s \
--update-failure-action rollback \
--update-max-failure-ratio 0.25 \
web
These parameters mean:
- --update-parallelism 2 — Update 2 replicas at a time
- --update-delay 10s — Wait 10 seconds between batches
- --update-failure-action rollback — Automatically rollback if updates fail
- --update-max-failure-ratio 0.25 — Tolerate up to 25% failures before triggering rollback
# Manually rollback to the previous version
docker service rollback web
# Check rollback status
docker service ps web
--update-failure-action rollback in production. Without it, a bad image will replace all your healthy containers one by one until the entire service is down.
Overlay Networking
Overlay networks enable containers on different nodes to communicate as if they were on the same local network. Swarm handles the VXLAN encapsulation transparently.
# Create an overlay network
docker network create \
--driver overlay \
--subnet 10.0.10.0/24 \
--attachable \
app-network
# Create services on the same overlay network
docker service create \
--name api \
--network app-network \
--replicas 3 \
myapp/api:latest
docker service create \
--name postgres \
--network app-network \
--replicas 1 \
--mount type=volume,source=pgdata,target=/var/lib/postgresql/data \
postgres:16-alpine
The --attachable flag allows standalone containers (not just services) to join the overlay network, which is useful during debugging.
Service Discovery
Swarm provides built-in DNS-based service discovery. Every service gets a DNS entry that resolves to the virtual IP (VIP) of the service, which load-balances across all healthy replicas:
# From inside any container on the same network:
# 'postgres' resolves to the VIP of the postgres service
# 'api' resolves to the VIP of the api service
# You can also use tasks. to resolve individual task IPs
nslookup tasks.api
# Returns individual IP for each replica
Ingress Routing Mesh
When you publish a port, Swarm creates an ingress routing mesh. Any node in the cluster can accept traffic on that port, even if it is not running a replica of the service. The mesh routes the request to an available replica:
# Published port 80 is accessible on ALL nodes
docker service create \
--name web \
--publish published=80,target=80 \
--replicas 3 \
nginx:alpine
# Hitting ANY node IP on port 80 reaches the service:
curl http://192.168.1.10 # manager-1
curl http://192.168.1.11 # manager-2
curl http://192.168.1.20 # worker-1 (even if no replica runs here)
For bypassing the mesh and binding directly to the host, use mode=host:
docker service create \
--name web-direct \
--publish published=80,target=80,mode=host \
--mode global \
nginx:alpine
Stack Deploy with Compose Files
For production deployments, define your entire application in a Compose file and deploy it as a stack. This is the recommended way to manage Swarm services:
# docker-stack.yml
version: "3.8"
services:
web:
image: myapp/web:2.1.0
deploy:
replicas: 3
update_config:
parallelism: 1
delay: 10s
failure_action: rollback
rollback_config:
parallelism: 1
delay: 5s
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120s
resources:
limits:
cpus: "0.50"
memory: 256M
reservations:
cpus: "0.25"
memory: 128M
placement:
constraints:
- node.role == worker
ports:
- "80:8080"
networks:
- frontend
- backend
api:
image: myapp/api:2.1.0
deploy:
replicas: 2
update_config:
parallelism: 1
delay: 15s
failure_action: rollback
placement:
constraints:
- node.role == worker
environment:
DATABASE_URL: postgres://app:secret@db:5432/myapp
REDIS_URL: redis://cache:6379
networks:
- backend
secrets:
- db_password
- api_key
db:
image: postgres:16-alpine
deploy:
replicas: 1
placement:
constraints:
- node.labels.storage == ssd
volumes:
- pgdata:/var/lib/postgresql/data
environment:
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
networks:
- backend
secrets:
- db_password
cache:
image: redis:7-alpine
deploy:
replicas: 1
networks:
- backend
networks:
frontend:
driver: overlay
backend:
driver: overlay
internal: true
volumes:
pgdata:
secrets:
db_password:
external: true
api_key:
external: true
# Create secrets first
echo "supersecretpassword" | docker secret create db_password -
echo "my-api-key-value" | docker secret create api_key -
# Deploy the stack
docker stack deploy -c docker-stack.yml myapp
# List stacks
docker stack ls
# List services in a stack
docker stack services myapp
# View tasks across all services
docker stack ps myapp
# Remove a stack
docker stack rm myapp
Placement Constraints and Preferences
Control where services run using labels and constraints:
# Label nodes
docker node update --label-add storage=ssd worker-1
docker node update --label-add storage=hdd worker-2
docker node update --label-add region=us-east worker-1
docker node update --label-add region=us-west worker-2
# Constrain service to SSD nodes
docker service create \
--name db \
--constraint 'node.labels.storage == ssd' \
postgres:16
# Spread across regions (soft preference)
docker service create \
--name web \
--replicas 4 \
--placement-pref 'spread=node.labels.region' \
nginx:alpine
Managing Secrets and Configs
Swarm provides encrypted secret management and configuration objects:
# Create a secret from a file
docker secret create tls_cert ./server.crt
docker secret create tls_key ./server.key
# Create a config object
docker config create nginx_conf ./nginx.conf
# Use in a service
docker service create \
--name proxy \
--secret tls_cert \
--secret tls_key \
--config source=nginx_conf,target=/etc/nginx/nginx.conf \
nginx:alpine
# Secrets are mounted at /run/secrets/ inside the container
# They are stored encrypted in the Raft log and only sent to nodes
# that need them
docker run with --secret, it will fail. Use docker service create or stack deploy instead.
Health Checks and Self-Healing
Swarm uses health checks to determine whether a container is ready to receive traffic. Unhealthy containers are stopped and replaced automatically:
docker service create \
--name api \
--replicas 3 \
--health-cmd "curl -f http://localhost:8080/health || exit 1" \
--health-interval 10s \
--health-timeout 5s \
--health-retries 3 \
--health-start-period 30s \
myapp/api:latest
The --health-start-period gives the container time to start up before health checks are counted against it. This is critical for applications with slow startup times like Java services.
Draining Nodes for Maintenance
# Drain a node (existing tasks are moved to other nodes)
docker node update --availability drain worker-1
# Perform maintenance on worker-1...
# Bring it back
docker node update --availability active worker-1
# Pause a node (no new tasks, existing tasks keep running)
docker node update --availability pause worker-2
Monitoring Your Swarm
Keeping visibility into your Swarm cluster is essential. Use the built-in commands alongside external monitoring tools:
# Cluster-wide view
docker node ls
docker service ls
docker stack ps myapp --filter "desired-state=running"
# Node-level resource usage
docker node ps worker-1
# Service-level logs
docker service logs myapp_web --since 1h --follow
# System-wide events
docker events --filter type=service --since 1h
For production clusters, tools like usulnet provide a centralized dashboard where you can monitor all Swarm services, view logs, and manage deployments across multiple nodes without switching between terminal sessions.
Production Hardening Checklist
- Use an odd number of managers (3 for most clusters, 5 for large deployments)
- Drain manager nodes in clusters with more than 5 total nodes
- Enable autolock to encrypt the Raft log at rest:
docker swarm update --autolock=true # Save the unlock key securely! docker swarm unlock-key - Rotate join tokens periodically:
docker swarm join-token --rotate worker docker swarm join-token --rotate manager - Set resource limits on all services to prevent noisy neighbors
- Use overlay networks with encryption:
docker network create --driver overlay --opt encrypted secure-net - Implement health checks on every service
- Use secrets instead of environment variables for sensitive data
- Back up the Swarm state regularly:
sudo tar czf swarm-backup.tar.gz /var/lib/docker/swarm
Common Pitfalls
| Problem | Cause | Solution |
|---|---|---|
| Service stuck at 0/N replicas | Image pull failure or constraint mismatch | Check docker service ps --no-trunc <service> |
| Overlay network unreachable | Firewall blocking port 4789/udp | Open VXLAN port between all nodes |
| Cluster lost quorum | Majority of managers down | Force new cluster: docker swarm init --force-new-cluster |
| Tasks keep restarting | Application crash or OOM kill | Check logs and increase memory limits |
| Stack deploy hangs | Secret or config not found | Create external secrets/configs before deploying |
Conclusion
Docker Swarm remains a powerful and underappreciated orchestration platform. Its tight integration with Docker, zero-dependency setup, and intuitive service model make it an excellent choice for teams that need multi-node container orchestration without the complexity of Kubernetes. For small to medium clusters—especially those already invested in the Docker ecosystem—Swarm delivers production-grade orchestration with a fraction of the operational overhead.
Start with a three-node cluster, deploy your first stack, and iterate from there. The Compose-native workflow means you can reuse your existing development Compose files with minimal modifications for Swarm deployment.