Operations

Docker Swarm Troubleshooting: Diagnosing and Fixing Common Issues

April 9, 2025 · 20 min read

Docker Swarm failures fall into predictable categories: services that won't start, tasks stuck in pending state, network connectivity issues, split-brain scenarios from lost quorum, and nodes that fall out of the cluster unexpectedly. Each has a systematic diagnostic path. This guide provides the decision tree for each failure mode, the exact commands to run, and the fixes that resolve them.

Before diving into specific issues, internalize this workflow: check service status, inspect task errors, examine node availability, verify network connectivity, and review Docker daemon logs. Every Swarm problem is diagnosed through this sequence.

Diagnostic Command Reference

Bookmark these commands. You will use them in every troubleshooting session:

# Cluster overview
docker node ls                          # Node status and roles
docker service ls                       # All services with replica counts
docker stack ls                         # All deployed stacks

# Service diagnostics
docker service ps SERVICE_NAME          # Task history and placement
docker service ps --no-trunc SERVICE    # Full error messages
docker service inspect --pretty SERVICE # Service configuration
docker service logs SERVICE_NAME        # Aggregated logs from all tasks

# Node diagnostics
docker node inspect --pretty NODE_ID    # Node details and labels
docker node ps NODE_ID                  # All tasks on a specific node

# Container diagnostics (run on the node hosting the container)
docker inspect CONTAINER_ID             # Full container config
docker logs CONTAINER_ID                # Container logs
docker stats CONTAINER_ID               # Live resource usage

# System diagnostics
docker system info                      # Docker and Swarm configuration
docker system df                        # Disk usage
docker events                           # Real-time event stream

Problem: Service Won't Start

You deploy a service and the replicas stay at 0/N. This is the most common Swarm issue and has several possible causes.

Step 1: Check Task Errors

# Show task history with full error messages
docker service ps --no-trunc myapp_api

# Look for the error column. Common errors:
# "No such image" - Image doesn't exist or registry auth failed
# "insufficient resources" - Not enough CPU/memory on available nodes
# "no suitable node" - Placement constraints cannot be satisfied
# "task: non-zero exit (1)" - Container crashed on startup

Common Causes and Fixes

Error	Cause	Fix
`No such image`	Image not found in registry	Verify image name/tag, check registry auth with `--with-registry-auth`
`no suitable node`	Placement constraints don't match any node	Check `docker node ls` and node labels; verify constraints
`insufficient resources`	Resource reservations exceed available capacity	Add nodes, reduce reservations, or free resources
`non-zero exit`	Container crashes on startup	Check logs: `docker service logs myapp_api`
`starting container failed`	Docker daemon issue on the node	Check `journalctl -u docker` on the target node
`secret not found`	Referenced secret doesn't exist	Create the secret: `docker secret ls`

Step 2: Check Image Availability

# Verify the image exists in the registry
docker pull myapp/api:v2.1.0

# If using a private registry, check authentication
docker login myregistry.com

# For stack deployments, ensure --with-registry-auth
docker stack deploy -c docker-compose.yml --with-registry-auth myapp

Step 3: Check Placement Constraints

# List all node labels
docker node ls -q | xargs -I {} sh -c \
  'echo "--- $(docker node inspect --format "{{.Description.Hostname}}" {}) ---"; \
   docker node inspect --format "{{json .Spec.Labels}}" {} | jq .'

# Check if any node matches the service's constraints
docker service inspect myapp_api \
  --format '{{json .Spec.TaskTemplate.Placement.Constraints}}' | jq .

# Common fix: add the missing label
docker node update --label-add tier=backend worker-01

Problem: Tasks Stuck in "Pending" State

Tasks in "pending" state have been scheduled but cannot be assigned to a node. This is different from tasks that start and fail.

# Identify pending tasks
docker service ps myapp_api --filter "desired-state=running" \
  --format "{{.ID}} {{.Name}} {{.CurrentState}} {{.Error}}"

# Check for resource constraints
docker node ls --format "{{.Hostname}} {{.Status}} {{.Availability}}"

Causes of Pending Tasks

No available nodes: All nodes are drained, down, or paused
Resource exhaustion: CPU/memory reservations exceed what any node can provide
Unsatisfiable constraints: No node has the required labels
Port conflicts: In host mode, the port is already in use on all eligible nodes
Volume mount issues: A required bind mount path doesn't exist on any eligible node

# Check node resource availability
for node in $(docker node ls -q); do
  hostname=$(docker node inspect --format '{{.Description.Hostname}}' "$node")
  cpus=$(docker node inspect --format '{{.Description.Resources.NanoCPUs}}' "$node")
  mem=$(docker node inspect --format '{{.Description.Resources.MemoryBytes}}' "$node")
  avail=$(docker node inspect --format '{{.Spec.Availability}}' "$node")
  echo "$hostname: CPUs=$(echo "$cpus / 1000000000" | bc), Memory=$(echo "$mem / 1048576" | bc)MB, Availability=$avail"
done

Problem: Network Connectivity Issues

Services cannot communicate with each other, DNS resolution fails, or external traffic is not reaching published ports.

Service-to-Service Communication Failure

# Verify both services are on the same network
docker service inspect --format '{{json .Spec.TaskTemplate.Networks}}' svc_a | jq .
docker service inspect --format '{{json .Spec.TaskTemplate.Networks}}' svc_b | jq .

# Test DNS resolution from inside a container
docker exec -it $(docker ps -q -f name=svc_a) nslookup svc_b

# Test connectivity
docker exec -it $(docker ps -q -f name=svc_a) wget -qO- http://svc_b:8080/health

# If DNS fails, check the embedded DNS resolver
docker exec -it $(docker ps -q -f name=svc_a) cat /etc/resolv.conf
# Should show: nameserver 127.0.0.11

Ingress Routing Mesh Not Working

# Verify the service publishes ports
docker service inspect --format '{{json .Endpoint.Ports}}' myapp_web | jq .

# Check if the port is listening on all nodes
ss -tlnp | grep :80

# Test from each node
for node_ip in 10.0.1.10 10.0.1.11 10.0.1.12; do
  echo "Testing $node_ip..."
  curl -s -o /dev/null -w "%{http_code}" http://$node_ip:80
  echo ""
done

# Check ingress network health
docker network inspect ingress

Stale Network Endpoints

After node failures or network partitions, overlay networks can accumulate stale endpoints that cause routing issues:

# Inspect the network for stale containers
docker network inspect my-app-network \
  --format '{{range .Containers}}{{.Name}} ({{.IPv4Address}}){{println}}{{end}}'

# Remove the service from the network and re-add
docker service update \
  --network-rm my-app-network \
  --network-add my-app-network \
  myapp_api

# Nuclear option: recreate the overlay network
# WARNING: This requires updating all services on this network
docker network rm my-app-network
docker network create --driver overlay my-app-network

Warning: Recreating an overlay network disconnects all services attached to it. Only do this as a last resort. Update each service to rejoin the new network after recreation.

Problem: Split-Brain and Quorum Loss

Split-brain occurs when the Swarm's Raft consensus cluster loses quorum. This typically happens when a majority of manager nodes become unreachable simultaneously.

Symptoms

# On a manager that lost quorum:
docker node ls
# Error response from daemon: rpc error: code = Unknown
# desc = The swarm does not have a leader.

docker service ls
# Error response from daemon: rpc error: code = Unknown
# desc = The swarm does not have a leader.

Recovery: Managers Still Running

If the managers are running but cannot reach each other (network partition):

# Step 1: Identify which managers are reachable
for manager_ip in 10.0.1.10 10.0.1.11 10.0.1.12; do
  echo "Testing $manager_ip..."
  nc -zv $manager_ip 2377 2>&1
done

# Step 2: Fix network connectivity between managers
# This is environment-specific: check firewalls, VPN, physical network

# Step 3: Once connectivity is restored, Raft will re-elect a leader
# This is automatic; wait 30-60 seconds

# Step 4: Verify
docker node ls

Recovery: Majority of Managers Lost

If a majority of managers have permanently failed (disk failure, terminated instances):

# LAST RESORT: Force a new cluster from a single manager
# This creates a new single-node Raft cluster from the surviving manager
# Run on the remaining manager node:

docker swarm init --force-new-cluster --advertise-addr 10.0.1.10

# Rejoin other managers
docker swarm join-token manager
# Use the token on new manager nodes

# Verify cluster health
docker node ls

Important: --force-new-cluster preserves all services, networks, secrets, and configs from the existing Raft state on this node. It does not destroy data. However, it creates a single-manager cluster that has no fault tolerance until you add more managers.

Problem: Node Availability Issues

# Check node status
docker node ls
# ID       HOSTNAME     STATUS    AVAILABILITY   MANAGER STATUS
# abc123   manager-01   Ready     Active         Leader
# def456   worker-01    Ready     Active
# ghi789   worker-02    Down      Active
# jkl012   worker-03    Ready     Drain

# Node shows "Down" - the node cannot reach the manager
# Check on the node itself:
systemctl status docker
journalctl -u docker --since "1 hour ago"

# Common causes:
# - Docker daemon crashed
# - Network partition
# - Node ran out of disk space
# - Time sync issues (TLS cert validation fails)

Recovering a Down Node

# On the down node:
# Check Docker daemon
systemctl status docker

# Check disk space
df -h /var/lib/docker

# Check time sync
timedatectl status
chronyc tracking

# Restart Docker if needed
systemctl restart docker

# If node's Swarm state is corrupted:
docker swarm leave --force
# Then rejoin from a manager:
docker swarm join --token SWMTKN-xxx 10.0.1.10:2377

Problem: Raft Consensus Issues

# Check Raft status
docker info 2>/dev/null | grep -A10 "Swarm"

# Key fields to check:
# Is Manager: true
# Raft:
#  Snapshot Interval: 10000
#  Number of Old Snapshots to Retain: 0
#  Heartbeat Tick: 1
#  Election Tick: 10

# Check manager reachability
docker node inspect --format '{{.ManagerStatus.Reachability}}' $(docker node ls -q --filter role=manager)
# Should show "reachable" for all managers

Raft Log Compaction Issues

# If the Raft log grows too large (managers running low on disk)
# Check Raft data size
du -sh /var/lib/docker/swarm/raft/

# If the raft directory is huge, the cause is usually:
# 1. Too many service updates without cleanup
# 2. Large number of secrets/configs
# 3. High task churn

# Clean up unused resources
docker service rm $(docker service ls -q --filter "mode=replicated" | \
  xargs -I {} docker service inspect {} --format '{{if eq .Spec.Mode.Replicated.Replicas 0}}{{.ID}}{{end}}')

# Force Raft snapshot (reduces log size)
# This happens automatically, but you can trigger it by restarting the leader
docker node demote $(docker node ls --filter "role=manager" --format '{{.ID}}' | tail -1)
docker node promote $(docker node ls --filter "role=worker" --format '{{.ID}}' | tail -1)

Problem: Service Update Stuck or Failed

# Check update status
docker service inspect --pretty myapp_api | grep -A10 "UpdateStatus"

# If update is paused due to failure:
# Option 1: Fix the issue and retry
docker service update --force myapp_api

# Option 2: Roll back
docker service rollback myapp_api

# Option 3: Roll back to a specific image
docker service update --image myapp/api:v2.0.0 myapp_api

# Check why tasks are failing during update
docker service ps --no-trunc myapp_api | grep -v "Running"

Log Analysis

Docker daemon logs are the last line of defense when container-level and service-level diagnostics fail:

# View Docker daemon logs
# On systemd-based systems:
journalctl -u docker --since "30 minutes ago" --no-pager

# Filter for Swarm-related messages
journalctl -u docker | grep -i "swarm\|raft\|cluster"

# Filter for specific error levels
journalctl -u docker -p err --since "1 hour ago"

# Follow logs in real time
journalctl -u docker -f

Common Daemon Log Messages and Their Meaning

Log Message	Meaning	Action
`raft: became follower`	This manager lost leadership (normal during rotation)	None if a new leader was elected
`raft: election timeout`	Cannot reach other managers for election	Check network between managers
`node is not a swarm manager`	A worker node was asked to perform a manager operation	Direct commands to a manager node
`failed to allocate gateway`	Network address pool exhausted	Prune unused networks, increase subnet size
`transport: dial`	Cannot establish TLS connection to another node	Check certificates, time sync, firewall

Tip: usulnet aggregates logs and service status across all Swarm nodes into a single dashboard, eliminating the need to SSH into individual nodes during troubleshooting. When a service fails, you can see the task error, container logs, and node status all in one place rather than jumping between terminals.

Quick Troubleshooting Flowchart

Can you run docker node ls? No: Quorum issue (see split-brain section). Yes: continue.
Does docker service ls show 0/N replicas? Yes: Check task errors with docker service ps --no-trunc. No: continue.
Are tasks in "Pending" state? Yes: Check constraints, resources, node availability. No: continue.
Are tasks starting and crashing? Yes: Check docker service logs for application errors. No: continue.
Is the service running but unreachable? Yes: Check network attachment, DNS, published ports. No: continue.
Is performance degraded? Yes: Check docker stats, node resource utilization, network latency.

Conclusion

Swarm troubleshooting is methodical. Start with docker service ps --no-trunc for the error message, check node availability and resources, verify network connectivity, and inspect Docker daemon logs when all else fails. The commands in this guide cover the vast majority of production Swarm issues. Keep them in a runbook accessible to your on-call team.

Diagnostic Command Reference

Problem: Service Won't Start

Step 1: Check Task Errors

Common Causes and Fixes

Step 2: Check Image Availability

Step 3: Check Placement Constraints

Problem: Tasks Stuck in "Pending" State

Causes of Pending Tasks

Problem: Network Connectivity Issues

Service-to-Service Communication Failure

Ingress Routing Mesh Not Working

Stale Network Endpoints

Problem: Split-Brain and Quorum Loss

Symptoms

Recovery: Managers Still Running

Recovery: Majority of Managers Lost

Problem: Node Availability Issues

Recovering a Down Node

Problem: Raft Consensus Issues

Raft Log Compaction Issues

Problem: Service Update Stuck or Failed

Log Analysis

Common Daemon Log Messages and Their Meaning

Quick Troubleshooting Flowchart

Conclusion

Related Articles

Monitoring Docker Swarm: Metrics, Logging and Alerting at Scale

Docker Swarm High Availability: Building Resilient Container Infrastructure

Docker Swarm Networking Deep Dive: Overlay Networks, Ingress and Service Mesh