Docker Swarm Troubleshooting: Diagnosing and Fixing Common Issues
Docker Swarm failures fall into predictable categories: services that won't start, tasks stuck in pending state, network connectivity issues, split-brain scenarios from lost quorum, and nodes that fall out of the cluster unexpectedly. Each has a systematic diagnostic path. This guide provides the decision tree for each failure mode, the exact commands to run, and the fixes that resolve them.
Before diving into specific issues, internalize this workflow: check service status, inspect task errors, examine node availability, verify network connectivity, and review Docker daemon logs. Every Swarm problem is diagnosed through this sequence.
Diagnostic Command Reference
Bookmark these commands. You will use them in every troubleshooting session:
# Cluster overview
docker node ls # Node status and roles
docker service ls # All services with replica counts
docker stack ls # All deployed stacks
# Service diagnostics
docker service ps SERVICE_NAME # Task history and placement
docker service ps --no-trunc SERVICE # Full error messages
docker service inspect --pretty SERVICE # Service configuration
docker service logs SERVICE_NAME # Aggregated logs from all tasks
# Node diagnostics
docker node inspect --pretty NODE_ID # Node details and labels
docker node ps NODE_ID # All tasks on a specific node
# Container diagnostics (run on the node hosting the container)
docker inspect CONTAINER_ID # Full container config
docker logs CONTAINER_ID # Container logs
docker stats CONTAINER_ID # Live resource usage
# System diagnostics
docker system info # Docker and Swarm configuration
docker system df # Disk usage
docker events # Real-time event stream
Problem: Service Won't Start
You deploy a service and the replicas stay at 0/N. This is the most common Swarm issue and has several possible causes.
Step 1: Check Task Errors
# Show task history with full error messages
docker service ps --no-trunc myapp_api
# Look for the error column. Common errors:
# "No such image" - Image doesn't exist or registry auth failed
# "insufficient resources" - Not enough CPU/memory on available nodes
# "no suitable node" - Placement constraints cannot be satisfied
# "task: non-zero exit (1)" - Container crashed on startup
Common Causes and Fixes
| Error | Cause | Fix |
|---|---|---|
No such image |
Image not found in registry | Verify image name/tag, check registry auth with --with-registry-auth |
no suitable node |
Placement constraints don't match any node | Check docker node ls and node labels; verify constraints |
insufficient resources |
Resource reservations exceed available capacity | Add nodes, reduce reservations, or free resources |
non-zero exit |
Container crashes on startup | Check logs: docker service logs myapp_api |
starting container failed |
Docker daemon issue on the node | Check journalctl -u docker on the target node |
secret not found |
Referenced secret doesn't exist | Create the secret: docker secret ls |
Step 2: Check Image Availability
# Verify the image exists in the registry
docker pull myapp/api:v2.1.0
# If using a private registry, check authentication
docker login myregistry.com
# For stack deployments, ensure --with-registry-auth
docker stack deploy -c docker-compose.yml --with-registry-auth myapp
Step 3: Check Placement Constraints
# List all node labels
docker node ls -q | xargs -I {} sh -c \
'echo "--- $(docker node inspect --format "{{.Description.Hostname}}" {}) ---"; \
docker node inspect --format "{{json .Spec.Labels}}" {} | jq .'
# Check if any node matches the service's constraints
docker service inspect myapp_api \
--format '{{json .Spec.TaskTemplate.Placement.Constraints}}' | jq .
# Common fix: add the missing label
docker node update --label-add tier=backend worker-01
Problem: Tasks Stuck in "Pending" State
Tasks in "pending" state have been scheduled but cannot be assigned to a node. This is different from tasks that start and fail.
# Identify pending tasks
docker service ps myapp_api --filter "desired-state=running" \
--format "{{.ID}} {{.Name}} {{.CurrentState}} {{.Error}}"
# Check for resource constraints
docker node ls --format "{{.Hostname}} {{.Status}} {{.Availability}}"
Causes of Pending Tasks
- No available nodes: All nodes are drained, down, or paused
- Resource exhaustion: CPU/memory reservations exceed what any node can provide
- Unsatisfiable constraints: No node has the required labels
- Port conflicts: In host mode, the port is already in use on all eligible nodes
- Volume mount issues: A required bind mount path doesn't exist on any eligible node
# Check node resource availability
for node in $(docker node ls -q); do
hostname=$(docker node inspect --format '{{.Description.Hostname}}' "$node")
cpus=$(docker node inspect --format '{{.Description.Resources.NanoCPUs}}' "$node")
mem=$(docker node inspect --format '{{.Description.Resources.MemoryBytes}}' "$node")
avail=$(docker node inspect --format '{{.Spec.Availability}}' "$node")
echo "$hostname: CPUs=$(echo "$cpus / 1000000000" | bc), Memory=$(echo "$mem / 1048576" | bc)MB, Availability=$avail"
done
Problem: Network Connectivity Issues
Services cannot communicate with each other, DNS resolution fails, or external traffic is not reaching published ports.
Service-to-Service Communication Failure
# Verify both services are on the same network
docker service inspect --format '{{json .Spec.TaskTemplate.Networks}}' svc_a | jq .
docker service inspect --format '{{json .Spec.TaskTemplate.Networks}}' svc_b | jq .
# Test DNS resolution from inside a container
docker exec -it $(docker ps -q -f name=svc_a) nslookup svc_b
# Test connectivity
docker exec -it $(docker ps -q -f name=svc_a) wget -qO- http://svc_b:8080/health
# If DNS fails, check the embedded DNS resolver
docker exec -it $(docker ps -q -f name=svc_a) cat /etc/resolv.conf
# Should show: nameserver 127.0.0.11
Ingress Routing Mesh Not Working
# Verify the service publishes ports
docker service inspect --format '{{json .Endpoint.Ports}}' myapp_web | jq .
# Check if the port is listening on all nodes
ss -tlnp | grep :80
# Test from each node
for node_ip in 10.0.1.10 10.0.1.11 10.0.1.12; do
echo "Testing $node_ip..."
curl -s -o /dev/null -w "%{http_code}" http://$node_ip:80
echo ""
done
# Check ingress network health
docker network inspect ingress
Stale Network Endpoints
After node failures or network partitions, overlay networks can accumulate stale endpoints that cause routing issues:
# Inspect the network for stale containers
docker network inspect my-app-network \
--format '{{range .Containers}}{{.Name}} ({{.IPv4Address}}){{println}}{{end}}'
# Remove the service from the network and re-add
docker service update \
--network-rm my-app-network \
--network-add my-app-network \
myapp_api
# Nuclear option: recreate the overlay network
# WARNING: This requires updating all services on this network
docker network rm my-app-network
docker network create --driver overlay my-app-network
Problem: Split-Brain and Quorum Loss
Split-brain occurs when the Swarm's Raft consensus cluster loses quorum. This typically happens when a majority of manager nodes become unreachable simultaneously.
Symptoms
# On a manager that lost quorum:
docker node ls
# Error response from daemon: rpc error: code = Unknown
# desc = The swarm does not have a leader.
docker service ls
# Error response from daemon: rpc error: code = Unknown
# desc = The swarm does not have a leader.
Recovery: Managers Still Running
If the managers are running but cannot reach each other (network partition):
# Step 1: Identify which managers are reachable
for manager_ip in 10.0.1.10 10.0.1.11 10.0.1.12; do
echo "Testing $manager_ip..."
nc -zv $manager_ip 2377 2>&1
done
# Step 2: Fix network connectivity between managers
# This is environment-specific: check firewalls, VPN, physical network
# Step 3: Once connectivity is restored, Raft will re-elect a leader
# This is automatic; wait 30-60 seconds
# Step 4: Verify
docker node ls
Recovery: Majority of Managers Lost
If a majority of managers have permanently failed (disk failure, terminated instances):
# LAST RESORT: Force a new cluster from a single manager
# This creates a new single-node Raft cluster from the surviving manager
# Run on the remaining manager node:
docker swarm init --force-new-cluster --advertise-addr 10.0.1.10
# Rejoin other managers
docker swarm join-token manager
# Use the token on new manager nodes
# Verify cluster health
docker node ls
Important:
--force-new-clusterpreserves all services, networks, secrets, and configs from the existing Raft state on this node. It does not destroy data. However, it creates a single-manager cluster that has no fault tolerance until you add more managers.
Problem: Node Availability Issues
# Check node status
docker node ls
# ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
# abc123 manager-01 Ready Active Leader
# def456 worker-01 Ready Active
# ghi789 worker-02 Down Active
# jkl012 worker-03 Ready Drain
# Node shows "Down" - the node cannot reach the manager
# Check on the node itself:
systemctl status docker
journalctl -u docker --since "1 hour ago"
# Common causes:
# - Docker daemon crashed
# - Network partition
# - Node ran out of disk space
# - Time sync issues (TLS cert validation fails)
Recovering a Down Node
# On the down node:
# Check Docker daemon
systemctl status docker
# Check disk space
df -h /var/lib/docker
# Check time sync
timedatectl status
chronyc tracking
# Restart Docker if needed
systemctl restart docker
# If node's Swarm state is corrupted:
docker swarm leave --force
# Then rejoin from a manager:
docker swarm join --token SWMTKN-xxx 10.0.1.10:2377
Problem: Raft Consensus Issues
# Check Raft status
docker info 2>/dev/null | grep -A10 "Swarm"
# Key fields to check:
# Is Manager: true
# Raft:
# Snapshot Interval: 10000
# Number of Old Snapshots to Retain: 0
# Heartbeat Tick: 1
# Election Tick: 10
# Check manager reachability
docker node inspect --format '{{.ManagerStatus.Reachability}}' $(docker node ls -q --filter role=manager)
# Should show "reachable" for all managers
Raft Log Compaction Issues
# If the Raft log grows too large (managers running low on disk)
# Check Raft data size
du -sh /var/lib/docker/swarm/raft/
# If the raft directory is huge, the cause is usually:
# 1. Too many service updates without cleanup
# 2. Large number of secrets/configs
# 3. High task churn
# Clean up unused resources
docker service rm $(docker service ls -q --filter "mode=replicated" | \
xargs -I {} docker service inspect {} --format '{{if eq .Spec.Mode.Replicated.Replicas 0}}{{.ID}}{{end}}')
# Force Raft snapshot (reduces log size)
# This happens automatically, but you can trigger it by restarting the leader
docker node demote $(docker node ls --filter "role=manager" --format '{{.ID}}' | tail -1)
docker node promote $(docker node ls --filter "role=worker" --format '{{.ID}}' | tail -1)
Problem: Service Update Stuck or Failed
# Check update status
docker service inspect --pretty myapp_api | grep -A10 "UpdateStatus"
# If update is paused due to failure:
# Option 1: Fix the issue and retry
docker service update --force myapp_api
# Option 2: Roll back
docker service rollback myapp_api
# Option 3: Roll back to a specific image
docker service update --image myapp/api:v2.0.0 myapp_api
# Check why tasks are failing during update
docker service ps --no-trunc myapp_api | grep -v "Running"
Log Analysis
Docker daemon logs are the last line of defense when container-level and service-level diagnostics fail:
# View Docker daemon logs
# On systemd-based systems:
journalctl -u docker --since "30 minutes ago" --no-pager
# Filter for Swarm-related messages
journalctl -u docker | grep -i "swarm\|raft\|cluster"
# Filter for specific error levels
journalctl -u docker -p err --since "1 hour ago"
# Follow logs in real time
journalctl -u docker -f
Common Daemon Log Messages and Their Meaning
| Log Message | Meaning | Action |
|---|---|---|
raft: became follower |
This manager lost leadership (normal during rotation) | None if a new leader was elected |
raft: election timeout |
Cannot reach other managers for election | Check network between managers |
node is not a swarm manager |
A worker node was asked to perform a manager operation | Direct commands to a manager node |
failed to allocate gateway |
Network address pool exhausted | Prune unused networks, increase subnet size |
transport: dial |
Cannot establish TLS connection to another node | Check certificates, time sync, firewall |
Quick Troubleshooting Flowchart
- Can you run
docker node ls? No: Quorum issue (see split-brain section). Yes: continue. - Does
docker service lsshow 0/N replicas? Yes: Check task errors withdocker service ps --no-trunc. No: continue. - Are tasks in "Pending" state? Yes: Check constraints, resources, node availability. No: continue.
- Are tasks starting and crashing? Yes: Check
docker service logsfor application errors. No: continue. - Is the service running but unreachable? Yes: Check network attachment, DNS, published ports. No: continue.
- Is performance degraded? Yes: Check
docker stats, node resource utilization, network latency.
Conclusion
Swarm troubleshooting is methodical. Start with docker service ps --no-trunc for the error message, check node availability and resources, verify network connectivity, and inspect Docker daemon logs when all else fails. The commands in this guide cover the vast majority of production Swarm issues. Keep them in a runbook accessible to your on-call team.