Disaster Recovery Planning: Preparing for the Worst in Your Infrastructure
Disaster recovery is not about preventing disasters -- that is what redundancy, monitoring, and good practices are for. Disaster recovery is about what happens after things go wrong despite your best efforts. The server room floods. Ransomware encrypts your drives. A botched migration drops the wrong database. An upstream provider goes bankrupt overnight.
The difference between an inconvenient afternoon and a catastrophic, unrecoverable data loss is whether you planned for these scenarios before they happened. This guide walks you through building a disaster recovery plan that actually works when you need it.
RTO and RPO: Defining Your Requirements
Every DR plan starts with two numbers:
- RTO (Recovery Time Objective): How long can your services be down before the impact becomes unacceptable? This determines how fast you need to recover.
- RPO (Recovery Point Objective): How much data can you afford to lose? If your RPO is 1 hour, your backups must run at least hourly. If your RPO is zero, you need real-time replication.
| Service | RTO | RPO | Recovery Strategy |
|---|---|---|---|
| DNS / Pi-hole | 15 minutes | 24 hours | Spare Pi-hole on standby, config in Git |
| Reverse proxy | 30 minutes | N/A (config in Git) | Rebuild from Compose + config |
| PostgreSQL database | 1 hour | 1 hour | Hourly dumps + streaming replication |
| Nextcloud | 4 hours | 24 hours | Daily backup + volume restore |
| Monitoring stack | 24 hours | 7 days | Rebuild from Compose, import dashboards |
| Media server | 48 hours | N/A (re-downloadable) | Rebuild from scratch |
Start with the critical path. Identify the minimum set of services that must be running for your infrastructure to be useful. For most self-hosted setups, that is DNS, reverse proxy, and the primary database. Everything else can wait.
Disaster Scenarios
A good DR plan considers specific scenarios, not abstract risks. For each scenario, document the detection method, impact, and recovery procedure:
Scenario 1: Disk Failure
- Detection: SMART alerts, I/O errors in dmesg, monitoring alerts
- Impact: Total data loss on affected disk
- Recovery: Replace disk, restore from latest backup, verify data integrity
Scenario 2: Ransomware / Compromise
- Detection: Encrypted files, unusual processes, alerts from security scanning
- Impact: All accessible data encrypted or destroyed
- Recovery: Isolate affected systems, wipe and reinstall from clean media, restore from off-site immutable backups, rotate all credentials
Scenario 3: Accidental Deletion
- Detection: User reports, monitoring gaps, missing containers
- Impact: Loss of specific service or dataset
- Recovery: Restore specific files or volumes from backup
Scenario 4: Infrastructure Provider Failure
- Detection: Unreachable servers, provider status page
- Impact: Complete unavailability of cloud-hosted services
- Recovery: Spin up replacement infrastructure, restore from off-provider backups
Scenario 5: Physical Disaster (Fire, Flood)
- Detection: Physical damage, total connectivity loss
- Impact: Complete loss of on-site hardware and local backups
- Recovery: Procure new hardware, restore from off-site backups
Documentation: The Recovery Runbook
A runbook is a step-by-step procedure for recovering each critical service. It must be written clearly enough that someone unfamiliar with the system can follow it under stress at 3 AM:
# Runbook: PostgreSQL Database Recovery
# Last tested: 2025-03-15
# Estimated recovery time: 45 minutes
## Prerequisites
- Access to backup storage (B2 credentials in Bitwarden vault)
- Fresh Debian 12 server or existing Docker host
- restic password (in Bitwarden vault under "Backup Encryption")
## Step 1: Locate the Latest Backup
restic snapshots \
--repo s3:s3.us-west-000.backblazeb2.com/homelab-backups \
--tag postgres \
--latest 5
## Step 2: Restore the Backup
mkdir -p /tmp/pg-restore
restic restore latest \
--repo s3:s3.us-west-000.backblazeb2.com/homelab-backups \
--tag postgres \
--target /tmp/pg-restore \
--include "/opt/docker/db-dumps/"
## Step 3: Start a Fresh PostgreSQL Container
docker compose -f /opt/docker/postgres/docker-compose.yml up -d
# Wait for PostgreSQL to be ready
until docker exec postgres pg_isready; do sleep 2; done
## Step 4: Restore the Database Dump
gunzip -c /tmp/pg-restore/opt/docker/db-dumps/postgres_latest.sql.gz | \
docker exec -i postgres psql -U postgres
## Step 5: Verify Restoration
docker exec postgres psql -U postgres -c \
"SELECT schemaname, tablename FROM pg_tables WHERE schemaname='public';"
# Expected output: list of application tables
# Verify row counts for critical tables:
docker exec postgres psql -U postgres -d myapp -c \
"SELECT 'users' as table_name, count(*) FROM users
UNION ALL
SELECT 'orders', count(*) FROM orders;"
## Step 6: Update Application Connection Strings
# If the database server IP changed, update .env files:
# /opt/docker/app/.env -> DATABASE_URL
# Then restart dependent services:
docker compose -f /opt/docker/app/docker-compose.yml restart
## Step 7: Verify Application Functionality
curl -s http://localhost:8080/health | jq .
# Expected: {"status": "ok", "database": "connected"}
Backup Verification
A DR plan built on untested backups is not a plan. It is a gamble. Automate regular verification:
#!/bin/bash
# dr-verification.sh - Monthly disaster recovery verification
set -euo pipefail
LOG="/var/log/dr-verification.log"
ERRORS=0
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG"; }
# Test 1: Verify backup freshness
verify_backup_freshness() {
log "=== Test 1: Backup Freshness ==="
local latest=$(restic snapshots --repo "$REPO" --latest 1 --json | \
jq -r '.[0].time' | cut -d'T' -f1)
local today=$(date +%Y-%m-%d)
local age_days=$(( ($(date -d "$today" +%s) - $(date -d "$latest" +%s)) / 86400 ))
if [ "$age_days" -gt 2 ]; then
log "FAIL: Latest backup is $age_days days old (expected < 2)"
ERRORS=$((ERRORS + 1))
else
log "PASS: Latest backup is $age_days days old"
fi
}
# Test 2: Verify backup integrity
verify_backup_integrity() {
log "=== Test 2: Backup Integrity ==="
if restic check --repo "$REPO" --read-data-subset=10%; then
log "PASS: Backup integrity verified"
else
log "FAIL: Backup integrity check failed"
ERRORS=$((ERRORS + 1))
fi
}
# Test 3: Test full restore
verify_restore() {
log "=== Test 3: Full Restore Test ==="
local restore_dir="/tmp/dr-test-$(date +%s)"
mkdir -p "$restore_dir"
restic restore latest --repo "$REPO" --target "$restore_dir" --include "/opt/docker"
# Verify critical files exist
local critical_files=(
"opt/docker/postgres/docker-compose.yml"
"opt/docker/db-dumps"
"opt/docker/traefik/docker-compose.yml"
)
for f in "${critical_files[@]}"; do
if [ -e "$restore_dir/$f" ]; then
log "PASS: Found $f"
else
log "FAIL: Missing $f"
ERRORS=$((ERRORS + 1))
fi
done
# Test database dump integrity
for dump in "$restore_dir"/opt/docker/db-dumps/*.sql.gz; do
if gunzip -t "$dump" 2>/dev/null; then
log "PASS: $(basename $dump) integrity OK"
else
log "FAIL: $(basename $dump) is corrupted"
ERRORS=$((ERRORS + 1))
fi
done
rm -rf "$restore_dir"
}
# Test 4: Verify off-site backup
verify_offsite() {
log "=== Test 4: Off-site Backup ==="
if restic snapshots --repo "$OFFSITE_REPO" --latest 1 > /dev/null 2>&1; then
log "PASS: Off-site backup accessible"
else
log "FAIL: Off-site backup not accessible"
ERRORS=$((ERRORS + 1))
fi
}
# Run all tests
verify_backup_freshness
verify_backup_integrity
verify_restore
verify_offsite
log "=== DR Verification Complete: $ERRORS errors ==="
if [ "$ERRORS" -gt 0 ]; then
# Send alert
curl -fsS "$HEALTHCHECK_URL/fail" -d "DR verification: $ERRORS errors"
exit 1
else
curl -fsS "$HEALTHCHECK_URL"
fi
Failover Procedures
Document exactly how to fail over each critical service. There are three types of failover:
| Type | Automation | Downtime | Example |
|---|---|---|---|
| Automatic | Fully automated | Seconds to minutes | Keepalived VIP failover |
| Semi-automatic | One command to execute | Minutes | Promote DB replica, update DNS |
| Manual | Follow runbook steps | Hours | Restore from backup to new hardware |
# Semi-automatic failover script for PostgreSQL
#!/bin/bash
# failover-postgres.sh
set -euo pipefail
REPLICA_HOST="192.168.1.111"
REPLICA_USER="admin"
echo "=== PostgreSQL Failover ==="
echo "This will promote the replica at $REPLICA_HOST to primary."
echo "The current primary will be DISCONNECTED."
read -p "Continue? (yes/no): " confirm
[ "$confirm" = "yes" ] || exit 1
# Step 1: Promote replica to primary
ssh "$REPLICA_USER@$REPLICA_HOST" \
"sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/main"
echo "Replica promoted. Waiting for promotion..."
sleep 5
# Step 2: Verify promotion
ssh "$REPLICA_USER@$REPLICA_HOST" \
"sudo -u postgres psql -c 'SELECT pg_is_in_recovery();'"
# Should return 'f' (false = not in recovery = is primary)
# Step 3: Update application connection strings
# Update .env files or DNS to point to new primary
echo "Update DATABASE_HOST to $REPLICA_HOST in your .env files"
echo "Then restart application containers"
echo "=== Failover Complete ==="
Communication Plan
During a disaster, clear communication is as important as technical recovery. Document:
- Who needs to know: Users, stakeholders, team members.
- How to reach them: Email, Slack, phone. Have backup communication channels.
- What to communicate: Status updates at regular intervals, estimated recovery time, workarounds.
- Status page: A simple static page hosted externally (e.g., GitHub Pages) that you can update during outages.
Testing Your DR Plan
A DR plan that has not been tested is a hypothesis, not a plan. Schedule regular DR tests:
- Monthly: Automated backup verification (the script above).
- Quarterly: Restore a single service from backup to a test environment. Time it.
- Annually: Full DR test. Pretend your primary server is gone. Recover everything from scratch on a fresh machine using only your backups and runbooks.
Cold, Warm, and Hot Sites
| Type | Definition | Recovery Time | Cost |
|---|---|---|---|
| Cold site | Backup data exists, no standby infrastructure | Hours to days | Low (just backup storage) |
| Warm site | Standby server with recent backups, needs manual activation | 30 minutes to 2 hours | Medium (standby hardware + sync) |
| Hot site | Live replica with automatic failover | Seconds to minutes | High (duplicate infrastructure) |
For most self-hosted setups, a warm site offers the best balance. A second, smaller server that receives daily backups and can be promoted to primary within an hour. Combined with infrastructure-as-code, the recovery process becomes: restore backups, run the Ansible playbook, update DNS.
Lessons from Real Outages
- GitLab.com (2017): An engineer accidentally deleted the wrong PostgreSQL directory during maintenance. 5 of 5 backup methods failed. They recovered from a 6-hour-old LVM snapshot that happened to exist. Lesson: test your backups, and have multiple independent backup methods.
- OVH Strasbourg (2021): A fire destroyed an entire datacenter. Customers who stored backups in the same datacenter lost everything. Lesson: off-site means genuinely off-site, not "on a different server in the same building."
- Rackspace Exchange (2022): A ransomware attack on hosted Exchange servers caused weeks of downtime. Some customer data was permanently lost. Lesson: even managed services need independent backups.
The common thread: Every major outage post-mortem reveals the same pattern -- the backup or failover mechanism was assumed to work but had not been tested recently. The organizations that recovered quickly were the ones that had practiced.
DR Plan Checklist
- Define RTO/RPO for every critical service.
- Document specific disaster scenarios and recovery procedures.
- Write runbooks with step-by-step commands for recovering each service.
- Automate backup verification with monthly testing.
- Maintain off-site, immutable backups that survive ransomware.
- Document the recovery order (DNS first, then proxy, then databases, then applications).
- Store credentials for recovery in a separate, accessible location (password manager, encrypted USB).
- Schedule quarterly restore tests.
- Conduct an annual full DR exercise.
- Update the plan after every test and every real incident.
With tools like usulnet providing centralized visibility into your Docker infrastructure across multiple hosts, you gain better awareness of what needs protecting. The built-in backup management features help ensure that every critical volume is backed up and that backup jobs are running on schedule -- reducing the gap between your actual RPO and your target RPO.