High Availability for Self-Hosted Services: Eliminating Single Points of Failure
Every component in your infrastructure is a potential point of failure: the server itself, its power supply, the network switch, the disk, the operating system, even the Docker daemon. High availability (HA) is the practice of eliminating single points of failure so that the loss of any one component does not take down your services.
For most self-hosted environments, perfect HA is neither necessary nor practical. The goal is to identify which services are critical enough to justify redundancy and then apply the right level of protection. A personal blog can tolerate an hour of downtime. A family's Nextcloud server or a business application probably cannot.
HA Fundamentals
| Availability Target | Allowed Downtime/Year | Typical Use Case |
|---|---|---|
| 99% (two nines) | 3.65 days | Personal projects, internal tools |
| 99.9% (three nines) | 8.76 hours | Small business, homelab critical services |
| 99.99% (four nines) | 52.6 minutes | Production web applications |
| 99.999% (five nines) | 5.26 minutes | Financial systems, telecom |
Key insight: Each additional nine of availability approximately doubles the cost and complexity. For self-hosted infrastructure, 99.9% (three nines) is a realistic and worthwhile target for critical services. It requires two servers and some automation. Four nines requires significant investment in redundancy at every layer.
Load Balancing with HAProxy
HAProxy is the industry standard for TCP/HTTP load balancing. It distributes traffic across multiple backend servers and automatically removes unhealthy backends from the rotation:
# /etc/haproxy/haproxy.cfg
global
maxconn 4096
log stdout format raw local0
defaults
mode http
timeout connect 5s
timeout client 30s
timeout server 30s
option httplog
option forwardfor
log global
# Stats dashboard
listen stats
bind *:8404
stats enable
stats uri /stats
stats refresh 10s
stats admin if TRUE
# HTTP frontend
frontend http_front
bind *:80
redirect scheme https code 301
# HTTPS frontend
frontend https_front
bind *:443 ssl crt /etc/haproxy/certs/
http-request set-header X-Forwarded-Proto https
# Route based on hostname
acl host_grafana hdr(host) -i grafana.example.com
acl host_app hdr(host) -i app.example.com
use_backend grafana_backend if host_grafana
use_backend app_backend if host_app
default_backend app_backend
# Backend with health checks
backend app_backend
balance roundrobin
option httpchk GET /health
http-check expect status 200
server app1 192.168.1.101:8080 check inter 5s fall 3 rise 2
server app2 192.168.1.102:8080 check inter 5s fall 3 rise 2
server app3 192.168.1.103:8080 check inter 5s fall 3 rise 2 backup
backend grafana_backend
balance roundrobin
option httpchk GET /api/health
cookie SERVERID insert indirect nocache
server grafana1 192.168.1.101:3000 check cookie s1
server grafana2 192.168.1.102:3000 check cookie s2
HAProxy in Docker
# docker-compose.yml
services:
haproxy:
image: haproxy:lts-alpine
container_name: haproxy
restart: unless-stopped
ports:
- "80:80"
- "443:443"
- "8404:8404"
volumes:
- ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro
- ./certs:/etc/haproxy/certs:ro
networks:
- frontend
Virtual IP Failover with Keepalived
A load balancer is itself a single point of failure. Keepalived uses the VRRP protocol to share a virtual IP (VIP) between two or more servers. If the primary fails, the secondary takes over the VIP within seconds:
# Install Keepalived
sudo apt install -y keepalived
# /etc/keepalived/keepalived.conf (Primary - MASTER)
vrrp_script chk_haproxy {
script "/usr/bin/killall -0 haproxy"
interval 2
weight 2
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 101
advert_int 1
authentication {
auth_type PASS
auth_pass your_secret_here
}
virtual_ipaddress {
192.168.1.200/24
}
track_script {
chk_haproxy
}
}
# /etc/keepalived/keepalived.conf (Secondary - BACKUP)
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass your_secret_here
}
virtual_ipaddress {
192.168.1.200/24
}
track_script {
chk_haproxy
}
}
Point your DNS records at 192.168.1.200 (the VIP). Both servers run HAProxy, but only the MASTER holds the VIP. If the MASTER fails, the BACKUP takes over the VIP within ~3 seconds.
Database Replication
Databases are the most critical stateful component in any infrastructure. Running a single database server means a single disk failure can take down every service that depends on it.
PostgreSQL Streaming Replication
# On the PRIMARY server:
# postgresql.conf
wal_level = replica
max_wal_senders = 3
wal_keep_size = 1024
# pg_hba.conf - allow replication connections
host replication replicator 192.168.1.0/24 scram-sha-256
# Create replication user
sudo -u postgres psql -c \
"CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';"
# On the REPLICA server:
# Stop PostgreSQL, clear data directory
sudo systemctl stop postgresql
sudo rm -rf /var/lib/postgresql/16/main/*
# Take a base backup from the primary
sudo -u postgres pg_basebackup \
-h 192.168.1.110 \
-U replicator \
-D /var/lib/postgresql/16/main \
-Fp -Xs -P -R
# The -R flag creates standby.signal and sets primary_conninfo
# Start PostgreSQL on the replica
sudo systemctl start postgresql
# Verify replication status on the primary
sudo -u postgres psql -c "SELECT * FROM pg_stat_replication;"
MySQL/MariaDB Replication
# On the PRIMARY:
# /etc/mysql/mariadb.conf.d/50-server.cnf
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = myapp
# Create replication user
CREATE USER 'replicator'@'192.168.1.%' IDENTIFIED BY 'secure_password';
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'192.168.1.%';
FLUSH PRIVILEGES;
SHOW MASTER STATUS; # Note the File and Position
# On the REPLICA:
[mysqld]
server-id = 2
relay_log = /var/log/mysql/mysql-relay-bin.log
CHANGE MASTER TO
MASTER_HOST='192.168.1.110',
MASTER_USER='replicator',
MASTER_PASSWORD='secure_password',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=154;
START SLAVE;
SHOW SLAVE STATUS\G
Shared Storage
When multiple servers need access to the same files (application uploads, shared configurations), you need shared storage:
| Solution | Type | Use Case | Complexity |
|---|---|---|---|
| NFS | Network filesystem | Simple shared storage | Low |
| GlusterFS | Distributed filesystem | Replicated storage across nodes | Medium |
| Ceph | Distributed object/block/file | Large-scale production | High |
| MinIO | S3-compatible object storage | Application storage (backups, uploads) | Low |
| Syncthing | File synchronization | Config sync between nodes | Low |
# Quick NFS setup for shared Docker volumes
# On the NFS server:
sudo apt install -y nfs-kernel-server
echo "/srv/nfs/shared 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)" | \
sudo tee -a /etc/exports
sudo exportfs -ra
# On the Docker hosts:
sudo apt install -y nfs-common
sudo mount -t nfs 192.168.1.100:/srv/nfs/shared /mnt/shared
# Docker volume with NFS driver
docker volume create \
--driver local \
--opt type=nfs \
--opt o=addr=192.168.1.100,rw \
--opt device=:/srv/nfs/shared \
shared_data
DNS Failover
Multiple A records for the same domain provide basic failover at the DNS level. Most DNS providers support health checks that automatically remove unhealthy endpoints:
# Multiple A records (DNS round-robin)
app.example.com. 300 IN A 192.168.1.101
app.example.com. 300 IN A 192.168.1.102
# With Cloudflare (health-checked failover):
# Primary: app.example.com -> 203.0.113.10 (active health check)
# Failover: app.example.com -> 203.0.113.20 (activated when primary fails)
# Low TTL (300 seconds) ensures clients pick up changes quickly
Health Checks
# Docker health check in Compose
services:
app:
image: myapp:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
# HAProxy health check configuration
backend app_backend
option httpchk GET /health HTTP/1.1\r\nHost:\ app.example.com
http-check expect status 200
server app1 192.168.1.101:8080 check inter 5s fall 3 rise 2
# Custom health check script
#!/bin/bash
# health-check.sh
services=("http://localhost:3000/api/health" "http://localhost:9090/-/healthy" "http://localhost:8080/health")
for url in "${services[@]}"; do
status=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 "$url")
if [ "$status" != "200" ]; then
echo "UNHEALTHY: $url returned $status"
# Send alert
fi
done
Graceful Degradation
Not every component failure should result in total service outage. Design your services to degrade gracefully:
- Cache layer down: Application falls back to direct database queries (slower but functional).
- Search service down: Disable search but keep the rest of the application running.
- Monitoring down: Services continue running; you lose visibility but not functionality.
- Backup service down: Alert but do not block application operation.
Design principle: Every external dependency should have a timeout, a circuit breaker, and a fallback. If a dependency is unavailable, the system should degrade to a reduced but functional state rather than failing entirely.
Docker Swarm HA
Docker Swarm provides built-in high availability for containers across multiple nodes:
# Initialize Swarm on the first manager
docker swarm init --advertise-addr 192.168.1.101
# Add manager nodes (minimum 3 for HA)
docker swarm join-token manager
# Run the provided join command on server2 and server3
# Add worker nodes
docker swarm join-token worker
# Run the join command on worker nodes
# Deploy a service with replicas
docker service create \
--name myapp \
--replicas 3 \
--publish 8080:8080 \
--update-delay 10s \
--update-parallelism 1 \
--restart-condition any \
--restart-max-attempts 3 \
myapp:latest
# Check service status
docker service ls
docker service ps myapp
# Scale up or down
docker service scale myapp=5
# Rolling update
docker service update \
--image myapp:v2.0 \
--update-parallelism 1 \
--update-delay 30s \
myapp
Pacemaker and Corosync
For non-containerized services or bare-metal HA, Pacemaker (cluster resource manager) and Corosync (cluster communication) are the traditional Linux HA stack:
# Install on both nodes
sudo apt install -y pacemaker corosync pcs
# Set hacluster user password on both nodes
sudo passwd hacluster
# Authenticate nodes
sudo pcs host auth node1 node2 -u hacluster
# Create the cluster
sudo pcs cluster setup ha-cluster node1 node2
# Start the cluster
sudo pcs cluster start --all
sudo pcs cluster enable --all
# Configure a virtual IP resource
sudo pcs resource create vip ocf:heartbeat:IPaddr2 \
ip=192.168.1.200 cidr_netmask=24 \
op monitor interval=30s
# Configure an Nginx resource
sudo pcs resource create webserver systemd:nginx \
op monitor interval=30s
# Ensure VIP and webserver run on the same node
sudo pcs constraint colocation add webserver with vip INFINITY
sudo pcs constraint order vip then webserver
# Check cluster status
sudo pcs status
HA Architecture for Self-Hosted
A practical HA architecture for a homelab or small business with two servers:
- Both servers run Docker with identical Compose stacks.
- Keepalived provides a VIP that floats between them.
- HAProxy on both servers load-balances to both backends.
- PostgreSQL with streaming replication (primary on server1, replica on server2).
- Shared storage via NFS or Syncthing for application data.
- Monitoring with Prometheus on both nodes, alerting when either fails.
With usulnet's multi-node architecture, you can manage containers across both servers from a single interface. The agent-based design means that even if the master node goes down, containers on agent nodes continue running uninterrupted. Combined with Keepalived and HAProxy, this provides a robust HA setup for self-hosted Docker infrastructure.