Architecture

High Availability for Self-Hosted Services: Eliminating Single Points of Failure

March 30, 2025 · 20 min read

Every component in your infrastructure is a potential point of failure: the server itself, its power supply, the network switch, the disk, the operating system, even the Docker daemon. High availability (HA) is the practice of eliminating single points of failure so that the loss of any one component does not take down your services.

For most self-hosted environments, perfect HA is neither necessary nor practical. The goal is to identify which services are critical enough to justify redundancy and then apply the right level of protection. A personal blog can tolerate an hour of downtime. A family's Nextcloud server or a business application probably cannot.

HA Fundamentals

Availability Target	Allowed Downtime/Year	Typical Use Case
99% (two nines)	3.65 days	Personal projects, internal tools
99.9% (three nines)	8.76 hours	Small business, homelab critical services
99.99% (four nines)	52.6 minutes	Production web applications
99.999% (five nines)	5.26 minutes	Financial systems, telecom

Key insight: Each additional nine of availability approximately doubles the cost and complexity. For self-hosted infrastructure, 99.9% (three nines) is a realistic and worthwhile target for critical services. It requires two servers and some automation. Four nines requires significant investment in redundancy at every layer.

Load Balancing with HAProxy

HAProxy is the industry standard for TCP/HTTP load balancing. It distributes traffic across multiple backend servers and automatically removes unhealthy backends from the rotation:

# /etc/haproxy/haproxy.cfg
global
    maxconn 4096
    log stdout format raw local0

defaults
    mode http
    timeout connect 5s
    timeout client 30s
    timeout server 30s
    option httplog
    option forwardfor
    log global

# Stats dashboard
listen stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 10s
    stats admin if TRUE

# HTTP frontend
frontend http_front
    bind *:80
    redirect scheme https code 301

# HTTPS frontend
frontend https_front
    bind *:443 ssl crt /etc/haproxy/certs/
    http-request set-header X-Forwarded-Proto https

    # Route based on hostname
    acl host_grafana hdr(host) -i grafana.example.com
    acl host_app hdr(host) -i app.example.com

    use_backend grafana_backend if host_grafana
    use_backend app_backend if host_app
    default_backend app_backend

# Backend with health checks
backend app_backend
    balance roundrobin
    option httpchk GET /health
    http-check expect status 200

    server app1 192.168.1.101:8080 check inter 5s fall 3 rise 2
    server app2 192.168.1.102:8080 check inter 5s fall 3 rise 2
    server app3 192.168.1.103:8080 check inter 5s fall 3 rise 2 backup

backend grafana_backend
    balance roundrobin
    option httpchk GET /api/health
    cookie SERVERID insert indirect nocache

    server grafana1 192.168.1.101:3000 check cookie s1
    server grafana2 192.168.1.102:3000 check cookie s2

HAProxy in Docker

# docker-compose.yml
services:
  haproxy:
    image: haproxy:lts-alpine
    container_name: haproxy
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
      - "8404:8404"
    volumes:
      - ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro
      - ./certs:/etc/haproxy/certs:ro
    networks:
      - frontend

Virtual IP Failover with Keepalived

A load balancer is itself a single point of failure. Keepalived uses the VRRP protocol to share a virtual IP (VIP) between two or more servers. If the primary fails, the secondary takes over the VIP within seconds:

# Install Keepalived
sudo apt install -y keepalived

# /etc/keepalived/keepalived.conf (Primary - MASTER)
vrrp_script chk_haproxy {
    script "/usr/bin/killall -0 haproxy"
    interval 2
    weight 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 101
    advert_int 1

    authentication {
        auth_type PASS
        auth_pass your_secret_here
    }

    virtual_ipaddress {
        192.168.1.200/24
    }

    track_script {
        chk_haproxy
    }
}

# /etc/keepalived/keepalived.conf (Secondary - BACKUP)
vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1

    authentication {
        auth_type PASS
        auth_pass your_secret_here
    }

    virtual_ipaddress {
        192.168.1.200/24
    }

    track_script {
        chk_haproxy
    }
}

Point your DNS records at 192.168.1.200 (the VIP). Both servers run HAProxy, but only the MASTER holds the VIP. If the MASTER fails, the BACKUP takes over the VIP within ~3 seconds.

Database Replication

Databases are the most critical stateful component in any infrastructure. Running a single database server means a single disk failure can take down every service that depends on it.

PostgreSQL Streaming Replication

# On the PRIMARY server:
# postgresql.conf
wal_level = replica
max_wal_senders = 3
wal_keep_size = 1024

# pg_hba.conf - allow replication connections
host replication replicator 192.168.1.0/24 scram-sha-256

# Create replication user
sudo -u postgres psql -c \
  "CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';"

# On the REPLICA server:
# Stop PostgreSQL, clear data directory
sudo systemctl stop postgresql
sudo rm -rf /var/lib/postgresql/16/main/*

# Take a base backup from the primary
sudo -u postgres pg_basebackup \
  -h 192.168.1.110 \
  -U replicator \
  -D /var/lib/postgresql/16/main \
  -Fp -Xs -P -R

# The -R flag creates standby.signal and sets primary_conninfo
# Start PostgreSQL on the replica
sudo systemctl start postgresql

# Verify replication status on the primary
sudo -u postgres psql -c "SELECT * FROM pg_stat_replication;"

MySQL/MariaDB Replication

# On the PRIMARY:
# /etc/mysql/mariadb.conf.d/50-server.cnf
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = myapp

# Create replication user
CREATE USER 'replicator'@'192.168.1.%' IDENTIFIED BY 'secure_password';
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'192.168.1.%';
FLUSH PRIVILEGES;

SHOW MASTER STATUS;  # Note the File and Position

# On the REPLICA:
[mysqld]
server-id = 2
relay_log = /var/log/mysql/mysql-relay-bin.log

CHANGE MASTER TO
  MASTER_HOST='192.168.1.110',
  MASTER_USER='replicator',
  MASTER_PASSWORD='secure_password',
  MASTER_LOG_FILE='mysql-bin.000001',
  MASTER_LOG_POS=154;

START SLAVE;
SHOW SLAVE STATUS\G

Tip: For self-hosted environments with Docker, the simplest database HA approach is streaming replication with automated failover via Patroni (PostgreSQL) or Orchestrator (MySQL). For less critical data, a daily backup with a hot standby that you can promote manually is often sufficient.

Shared Storage

When multiple servers need access to the same files (application uploads, shared configurations), you need shared storage:

Solution	Type	Use Case	Complexity
NFS	Network filesystem	Simple shared storage	Low
GlusterFS	Distributed filesystem	Replicated storage across nodes	Medium
Ceph	Distributed object/block/file	Large-scale production	High
MinIO	S3-compatible object storage	Application storage (backups, uploads)	Low
Syncthing	File synchronization	Config sync between nodes	Low

# Quick NFS setup for shared Docker volumes
# On the NFS server:
sudo apt install -y nfs-kernel-server
echo "/srv/nfs/shared 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)" | \
  sudo tee -a /etc/exports
sudo exportfs -ra

# On the Docker hosts:
sudo apt install -y nfs-common
sudo mount -t nfs 192.168.1.100:/srv/nfs/shared /mnt/shared

# Docker volume with NFS driver
docker volume create \
  --driver local \
  --opt type=nfs \
  --opt o=addr=192.168.1.100,rw \
  --opt device=:/srv/nfs/shared \
  shared_data

DNS Failover

Multiple A records for the same domain provide basic failover at the DNS level. Most DNS providers support health checks that automatically remove unhealthy endpoints:

# Multiple A records (DNS round-robin)
app.example.com.  300  IN  A  192.168.1.101
app.example.com.  300  IN  A  192.168.1.102

# With Cloudflare (health-checked failover):
# Primary: app.example.com -> 203.0.113.10 (active health check)
# Failover: app.example.com -> 203.0.113.20 (activated when primary fails)

# Low TTL (300 seconds) ensures clients pick up changes quickly

Health Checks

# Docker health check in Compose
services:
  app:
    image: myapp:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

# HAProxy health check configuration
backend app_backend
    option httpchk GET /health HTTP/1.1\r\nHost:\ app.example.com
    http-check expect status 200
    server app1 192.168.1.101:8080 check inter 5s fall 3 rise 2

# Custom health check script
#!/bin/bash
# health-check.sh
services=("http://localhost:3000/api/health" "http://localhost:9090/-/healthy" "http://localhost:8080/health")

for url in "${services[@]}"; do
  status=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 "$url")
  if [ "$status" != "200" ]; then
    echo "UNHEALTHY: $url returned $status"
    # Send alert
  fi
done

Graceful Degradation

Not every component failure should result in total service outage. Design your services to degrade gracefully:

Cache layer down: Application falls back to direct database queries (slower but functional).
Search service down: Disable search but keep the rest of the application running.
Monitoring down: Services continue running; you lose visibility but not functionality.
Backup service down: Alert but do not block application operation.

Design principle: Every external dependency should have a timeout, a circuit breaker, and a fallback. If a dependency is unavailable, the system should degrade to a reduced but functional state rather than failing entirely.

Docker Swarm HA

Docker Swarm provides built-in high availability for containers across multiple nodes:

# Initialize Swarm on the first manager
docker swarm init --advertise-addr 192.168.1.101

# Add manager nodes (minimum 3 for HA)
docker swarm join-token manager
# Run the provided join command on server2 and server3

# Add worker nodes
docker swarm join-token worker
# Run the join command on worker nodes

# Deploy a service with replicas
docker service create \
  --name myapp \
  --replicas 3 \
  --publish 8080:8080 \
  --update-delay 10s \
  --update-parallelism 1 \
  --restart-condition any \
  --restart-max-attempts 3 \
  myapp:latest

# Check service status
docker service ls
docker service ps myapp

# Scale up or down
docker service scale myapp=5

# Rolling update
docker service update \
  --image myapp:v2.0 \
  --update-parallelism 1 \
  --update-delay 30s \
  myapp

Warning: Docker Swarm requires an odd number of manager nodes (3 or 5) for the Raft consensus algorithm to function correctly. With 3 managers, you can tolerate 1 failure. With 5 managers, you can tolerate 2 failures. Never run production Swarm with only 1 or 2 manager nodes.

Pacemaker and Corosync

For non-containerized services or bare-metal HA, Pacemaker (cluster resource manager) and Corosync (cluster communication) are the traditional Linux HA stack:

# Install on both nodes
sudo apt install -y pacemaker corosync pcs

# Set hacluster user password on both nodes
sudo passwd hacluster

# Authenticate nodes
sudo pcs host auth node1 node2 -u hacluster

# Create the cluster
sudo pcs cluster setup ha-cluster node1 node2

# Start the cluster
sudo pcs cluster start --all
sudo pcs cluster enable --all

# Configure a virtual IP resource
sudo pcs resource create vip ocf:heartbeat:IPaddr2 \
  ip=192.168.1.200 cidr_netmask=24 \
  op monitor interval=30s

# Configure an Nginx resource
sudo pcs resource create webserver systemd:nginx \
  op monitor interval=30s

# Ensure VIP and webserver run on the same node
sudo pcs constraint colocation add webserver with vip INFINITY
sudo pcs constraint order vip then webserver

# Check cluster status
sudo pcs status

HA Architecture for Self-Hosted

A practical HA architecture for a homelab or small business with two servers:

Both servers run Docker with identical Compose stacks.
Keepalived provides a VIP that floats between them.
HAProxy on both servers load-balances to both backends.
PostgreSQL with streaming replication (primary on server1, replica on server2).
Shared storage via NFS or Syncthing for application data.
Monitoring with Prometheus on both nodes, alerting when either fails.

With usulnet's multi-node architecture, you can manage containers across both servers from a single interface. The agent-based design means that even if the master node goes down, containers on agent nodes continue running uninterrupted. Combined with Keepalived and HAProxy, this provides a robust HA setup for self-hosted Docker infrastructure.

HA Fundamentals

Load Balancing with HAProxy

HAProxy in Docker

Virtual IP Failover with Keepalived

Database Replication

PostgreSQL Streaming Replication

MySQL/MariaDB Replication

Shared Storage

DNS Failover

Health Checks

Graceful Degradation

Docker Swarm HA

Pacemaker and Corosync

HA Architecture for Self-Hosted

Related Articles

Multi-Node Docker Architecture: Managing Containers Across Multiple Servers

Disaster Recovery Planning: Preparing for the Worst

Container Orchestration Compared: Swarm vs Kubernetes vs Nomad