Operations

Disaster Recovery Planning: Preparing for the Worst in Your Infrastructure

March 31, 2025 · 19 min read

Disaster recovery is not about preventing disasters -- that is what redundancy, monitoring, and good practices are for. Disaster recovery is about what happens after things go wrong despite your best efforts. The server room floods. Ransomware encrypts your drives. A botched migration drops the wrong database. An upstream provider goes bankrupt overnight.

The difference between an inconvenient afternoon and a catastrophic, unrecoverable data loss is whether you planned for these scenarios before they happened. This guide walks you through building a disaster recovery plan that actually works when you need it.

RTO and RPO: Defining Your Requirements

Every DR plan starts with two numbers:

RTO (Recovery Time Objective): How long can your services be down before the impact becomes unacceptable? This determines how fast you need to recover.
RPO (Recovery Point Objective): How much data can you afford to lose? If your RPO is 1 hour, your backups must run at least hourly. If your RPO is zero, you need real-time replication.

Service	RTO	RPO	Recovery Strategy
DNS / Pi-hole	15 minutes	24 hours	Spare Pi-hole on standby, config in Git
Reverse proxy	30 minutes	N/A (config in Git)	Rebuild from Compose + config
PostgreSQL database	1 hour	1 hour	Hourly dumps + streaming replication
Nextcloud	4 hours	24 hours	Daily backup + volume restore
Monitoring stack	24 hours	7 days	Rebuild from Compose, import dashboards
Media server	48 hours	N/A (re-downloadable)	Rebuild from scratch

Start with the critical path. Identify the minimum set of services that must be running for your infrastructure to be useful. For most self-hosted setups, that is DNS, reverse proxy, and the primary database. Everything else can wait.

Disaster Scenarios

A good DR plan considers specific scenarios, not abstract risks. For each scenario, document the detection method, impact, and recovery procedure:

Scenario 1: Disk Failure

Detection: SMART alerts, I/O errors in dmesg, monitoring alerts
Impact: Total data loss on affected disk
Recovery: Replace disk, restore from latest backup, verify data integrity

Scenario 2: Ransomware / Compromise

Detection: Encrypted files, unusual processes, alerts from security scanning
Impact: All accessible data encrypted or destroyed
Recovery: Isolate affected systems, wipe and reinstall from clean media, restore from off-site immutable backups, rotate all credentials

Scenario 3: Accidental Deletion

Detection: User reports, monitoring gaps, missing containers
Impact: Loss of specific service or dataset
Recovery: Restore specific files or volumes from backup

Scenario 4: Infrastructure Provider Failure

Detection: Unreachable servers, provider status page
Impact: Complete unavailability of cloud-hosted services
Recovery: Spin up replacement infrastructure, restore from off-provider backups

Scenario 5: Physical Disaster (Fire, Flood)

Detection: Physical damage, total connectivity loss
Impact: Complete loss of on-site hardware and local backups
Recovery: Procure new hardware, restore from off-site backups

Documentation: The Recovery Runbook

A runbook is a step-by-step procedure for recovering each critical service. It must be written clearly enough that someone unfamiliar with the system can follow it under stress at 3 AM:

# Runbook: PostgreSQL Database Recovery
# Last tested: 2025-03-15
# Estimated recovery time: 45 minutes

## Prerequisites
- Access to backup storage (B2 credentials in Bitwarden vault)
- Fresh Debian 12 server or existing Docker host
- restic password (in Bitwarden vault under "Backup Encryption")

## Step 1: Locate the Latest Backup
restic snapshots \
  --repo s3:s3.us-west-000.backblazeb2.com/homelab-backups \
  --tag postgres \
  --latest 5

## Step 2: Restore the Backup
mkdir -p /tmp/pg-restore
restic restore latest \
  --repo s3:s3.us-west-000.backblazeb2.com/homelab-backups \
  --tag postgres \
  --target /tmp/pg-restore \
  --include "/opt/docker/db-dumps/"

## Step 3: Start a Fresh PostgreSQL Container
docker compose -f /opt/docker/postgres/docker-compose.yml up -d

# Wait for PostgreSQL to be ready
until docker exec postgres pg_isready; do sleep 2; done

## Step 4: Restore the Database Dump
gunzip -c /tmp/pg-restore/opt/docker/db-dumps/postgres_latest.sql.gz | \
  docker exec -i postgres psql -U postgres

## Step 5: Verify Restoration
docker exec postgres psql -U postgres -c \
  "SELECT schemaname, tablename FROM pg_tables WHERE schemaname='public';"

# Expected output: list of application tables
# Verify row counts for critical tables:
docker exec postgres psql -U postgres -d myapp -c \
  "SELECT 'users' as table_name, count(*) FROM users
   UNION ALL
   SELECT 'orders', count(*) FROM orders;"

## Step 6: Update Application Connection Strings
# If the database server IP changed, update .env files:
# /opt/docker/app/.env -> DATABASE_URL
# Then restart dependent services:
docker compose -f /opt/docker/app/docker-compose.yml restart

## Step 7: Verify Application Functionality
curl -s http://localhost:8080/health | jq .
# Expected: {"status": "ok", "database": "connected"}

Tip: Store your runbooks in the same Git repository as your infrastructure code, AND keep a printed copy or an offline copy on an encrypted USB drive. If your Git repository is hosted on infrastructure that is itself affected by the disaster, you need an alternative way to access recovery procedures.

Backup Verification

A DR plan built on untested backups is not a plan. It is a gamble. Automate regular verification:

#!/bin/bash
# dr-verification.sh - Monthly disaster recovery verification
set -euo pipefail

LOG="/var/log/dr-verification.log"
ERRORS=0

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG"; }

# Test 1: Verify backup freshness
verify_backup_freshness() {
  log "=== Test 1: Backup Freshness ==="
  local latest=$(restic snapshots --repo "$REPO" --latest 1 --json | \
    jq -r '.[0].time' | cut -d'T' -f1)
  local today=$(date +%Y-%m-%d)
  local age_days=$(( ($(date -d "$today" +%s) - $(date -d "$latest" +%s)) / 86400 ))

  if [ "$age_days" -gt 2 ]; then
    log "FAIL: Latest backup is $age_days days old (expected < 2)"
    ERRORS=$((ERRORS + 1))
  else
    log "PASS: Latest backup is $age_days days old"
  fi
}

# Test 2: Verify backup integrity
verify_backup_integrity() {
  log "=== Test 2: Backup Integrity ==="
  if restic check --repo "$REPO" --read-data-subset=10%; then
    log "PASS: Backup integrity verified"
  else
    log "FAIL: Backup integrity check failed"
    ERRORS=$((ERRORS + 1))
  fi
}

# Test 3: Test full restore
verify_restore() {
  log "=== Test 3: Full Restore Test ==="
  local restore_dir="/tmp/dr-test-$(date +%s)"
  mkdir -p "$restore_dir"

  restic restore latest --repo "$REPO" --target "$restore_dir" --include "/opt/docker"

  # Verify critical files exist
  local critical_files=(
    "opt/docker/postgres/docker-compose.yml"
    "opt/docker/db-dumps"
    "opt/docker/traefik/docker-compose.yml"
  )

  for f in "${critical_files[@]}"; do
    if [ -e "$restore_dir/$f" ]; then
      log "PASS: Found $f"
    else
      log "FAIL: Missing $f"
      ERRORS=$((ERRORS + 1))
    fi
  done

  # Test database dump integrity
  for dump in "$restore_dir"/opt/docker/db-dumps/*.sql.gz; do
    if gunzip -t "$dump" 2>/dev/null; then
      log "PASS: $(basename $dump) integrity OK"
    else
      log "FAIL: $(basename $dump) is corrupted"
      ERRORS=$((ERRORS + 1))
    fi
  done

  rm -rf "$restore_dir"
}

# Test 4: Verify off-site backup
verify_offsite() {
  log "=== Test 4: Off-site Backup ==="
  if restic snapshots --repo "$OFFSITE_REPO" --latest 1 > /dev/null 2>&1; then
    log "PASS: Off-site backup accessible"
  else
    log "FAIL: Off-site backup not accessible"
    ERRORS=$((ERRORS + 1))
  fi
}

# Run all tests
verify_backup_freshness
verify_backup_integrity
verify_restore
verify_offsite

log "=== DR Verification Complete: $ERRORS errors ==="

if [ "$ERRORS" -gt 0 ]; then
  # Send alert
  curl -fsS "$HEALTHCHECK_URL/fail" -d "DR verification: $ERRORS errors"
  exit 1
else
  curl -fsS "$HEALTHCHECK_URL"
fi

Failover Procedures

Document exactly how to fail over each critical service. There are three types of failover:

Type	Automation	Downtime	Example
Automatic	Fully automated	Seconds to minutes	Keepalived VIP failover
Semi-automatic	One command to execute	Minutes	Promote DB replica, update DNS
Manual	Follow runbook steps	Hours	Restore from backup to new hardware

# Semi-automatic failover script for PostgreSQL
#!/bin/bash
# failover-postgres.sh
set -euo pipefail

REPLICA_HOST="192.168.1.111"
REPLICA_USER="admin"

echo "=== PostgreSQL Failover ==="
echo "This will promote the replica at $REPLICA_HOST to primary."
echo "The current primary will be DISCONNECTED."
read -p "Continue? (yes/no): " confirm
[ "$confirm" = "yes" ] || exit 1

# Step 1: Promote replica to primary
ssh "$REPLICA_USER@$REPLICA_HOST" \
  "sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/main"

echo "Replica promoted. Waiting for promotion..."
sleep 5

# Step 2: Verify promotion
ssh "$REPLICA_USER@$REPLICA_HOST" \
  "sudo -u postgres psql -c 'SELECT pg_is_in_recovery();'"
# Should return 'f' (false = not in recovery = is primary)

# Step 3: Update application connection strings
# Update .env files or DNS to point to new primary
echo "Update DATABASE_HOST to $REPLICA_HOST in your .env files"
echo "Then restart application containers"

echo "=== Failover Complete ==="

Communication Plan

During a disaster, clear communication is as important as technical recovery. Document:

Who needs to know: Users, stakeholders, team members.
How to reach them: Email, Slack, phone. Have backup communication channels.
What to communicate: Status updates at regular intervals, estimated recovery time, workarounds.
Status page: A simple static page hosted externally (e.g., GitHub Pages) that you can update during outages.

Testing Your DR Plan

A DR plan that has not been tested is a hypothesis, not a plan. Schedule regular DR tests:

Monthly: Automated backup verification (the script above).
Quarterly: Restore a single service from backup to a test environment. Time it.
Annually: Full DR test. Pretend your primary server is gone. Recover everything from scratch on a fresh machine using only your backups and runbooks.

Warning: During a full DR test, you will discover gaps in your runbooks. Expected credentials will be missing. Steps will be outdated. Restore commands will fail because of version mismatches. This is exactly why you test. Fix every gap you find and update the runbooks immediately.

Cold, Warm, and Hot Sites

Type	Definition	Recovery Time	Cost
Cold site	Backup data exists, no standby infrastructure	Hours to days	Low (just backup storage)
Warm site	Standby server with recent backups, needs manual activation	30 minutes to 2 hours	Medium (standby hardware + sync)
Hot site	Live replica with automatic failover	Seconds to minutes	High (duplicate infrastructure)

For most self-hosted setups, a warm site offers the best balance. A second, smaller server that receives daily backups and can be promoted to primary within an hour. Combined with infrastructure-as-code, the recovery process becomes: restore backups, run the Ansible playbook, update DNS.

Lessons from Real Outages

GitLab.com (2017): An engineer accidentally deleted the wrong PostgreSQL directory during maintenance. 5 of 5 backup methods failed. They recovered from a 6-hour-old LVM snapshot that happened to exist. Lesson: test your backups, and have multiple independent backup methods.
OVH Strasbourg (2021): A fire destroyed an entire datacenter. Customers who stored backups in the same datacenter lost everything. Lesson: off-site means genuinely off-site, not "on a different server in the same building."
Rackspace Exchange (2022): A ransomware attack on hosted Exchange servers caused weeks of downtime. Some customer data was permanently lost. Lesson: even managed services need independent backups.

The common thread: Every major outage post-mortem reveals the same pattern -- the backup or failover mechanism was assumed to work but had not been tested recently. The organizations that recovered quickly were the ones that had practiced.

DR Plan Checklist

Define RTO/RPO for every critical service.
Document specific disaster scenarios and recovery procedures.
Write runbooks with step-by-step commands for recovering each service.
Automate backup verification with monthly testing.
Maintain off-site, immutable backups that survive ransomware.
Document the recovery order (DNS first, then proxy, then databases, then applications).
Store credentials for recovery in a separate, accessible location (password manager, encrypted USB).
Schedule quarterly restore tests.
Conduct an annual full DR exercise.
Update the plan after every test and every real incident.

With tools like usulnet providing centralized visibility into your Docker infrastructure across multiple hosts, you gain better awareness of what needs protecting. The built-in backup management features help ensure that every critical volume is backed up and that backup jobs are running on schedule -- reducing the gap between your actual RPO and your target RPO.

RTO and RPO: Defining Your Requirements

Disaster Scenarios

Scenario 1: Disk Failure

Scenario 2: Ransomware / Compromise

Scenario 3: Accidental Deletion

Scenario 4: Infrastructure Provider Failure

Scenario 5: Physical Disaster (Fire, Flood)

Documentation: The Recovery Runbook

Backup Verification

Failover Procedures

Communication Plan

Testing Your DR Plan

Cold, Warm, and Hot Sites

Lessons from Real Outages

DR Plan Checklist

Related Articles

The 3-2-1 Backup Strategy: Protecting Your Data Like a Professional

Docker Backup Strategies: How to Protect Your Containers and Volumes

High Availability for Self-Hosted Services