Self-Hosted Monitoring: Building a Complete Observability Stack
Running self-hosted services without monitoring is like driving without a dashboard. You have no idea when something is about to fail until it already has. A proper observability stack gives you metrics (what is happening), logs (why it is happening), and alerts (when something needs attention). This guide builds a complete, production-ready monitoring stack using open-source tools deployed entirely with Docker Compose.
The stack we will build covers the three pillars of observability:
- Metrics: Prometheus for collection, Grafana for visualization, Alertmanager for notifications
- Logs: Loki for aggregation, Promtail for collection
- Availability: Uptime Kuma for endpoint monitoring
- System metrics: Netdata or node_exporter for host-level metrics
Architecture Overview
| Component | Role | RAM Usage | Port |
|---|---|---|---|
| Prometheus | Time-series metrics database | ~200 MB base | 9090 |
| Grafana | Visualization and dashboards | ~150 MB | 3000 |
| Alertmanager | Alert routing and deduplication | ~50 MB | 9093 |
| Loki | Log aggregation | ~200 MB | 3100 |
| Promtail | Log collection agent | ~50 MB | - |
| node_exporter | Host system metrics | ~20 MB | 9100 |
| cAdvisor | Container metrics | ~80 MB | 8080 |
| Uptime Kuma | Endpoint availability monitoring | ~100 MB | 3001 |
Total RAM for the entire stack: approximately 800 MB to 1 GB, which is reasonable for a monitoring infrastructure.
Docker Compose Stack
Here is the complete Docker Compose file for the monitoring stack:
version: "3.8"
networks:
monitoring:
driver: bridge
volumes:
prometheus_data:
grafana_data:
loki_data:
uptime_kuma_data:
services:
# --- Metrics Collection ---
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/rules/:/etc/prometheus/rules/:ro
- prometheus_data:/prometheus
ports:
- "9090:9090"
networks:
- monitoring
# --- Visualization ---
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
environment:
- GF_SECURITY_ADMIN_USER=${GRAFANA_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_ROOT_URL=https://grafana.example.com
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
ports:
- "3000:3000"
depends_on:
- prometheus
- loki
networks:
- monitoring
# --- Alerting ---
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
ports:
- "9093:9093"
networks:
- monitoring
# --- Log Aggregation ---
loki:
image: grafana/loki:latest
container_name: loki
restart: unless-stopped
command: -config.file=/etc/loki/loki.yml
volumes:
- ./loki/loki.yml:/etc/loki/loki.yml:ro
- loki_data:/loki
ports:
- "3100:3100"
networks:
- monitoring
promtail:
image: grafana/promtail:latest
container_name: promtail
restart: unless-stopped
command: -config.file=/etc/promtail/promtail.yml
volumes:
- ./promtail/promtail.yml:/etc/promtail/promtail.yml:ro
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
networks:
- monitoring
# --- Host Metrics ---
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
ports:
- "9100:9100"
networks:
- monitoring
# --- Container Metrics ---
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
privileged: true
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
ports:
- "8080:8080"
networks:
- monitoring
# --- Uptime Monitoring ---
uptime-kuma:
image: louislam/uptime-kuma:latest
container_name: uptime-kuma
restart: unless-stopped
volumes:
- uptime_kuma_data:/app/data
ports:
- "3001:3001"
networks:
- monitoring
Prometheus Configuration
Create the Prometheus configuration that scrapes all exporters:
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
rule_files:
- /etc/prometheus/rules/*.yml
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node exporter - host metrics
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
# cAdvisor - container metrics
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
# Docker daemon metrics (if enabled)
- job_name: 'docker'
static_configs:
- targets: ['host.docker.internal:9323']
# Application-specific targets
- job_name: 'web-apps'
metrics_path: /metrics
static_configs:
- targets:
- 'app1:8080'
- 'app2:8080'
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '(.+):\d+'
replacement: '${1}'
Alerting Rules
Define alerting rules for common failure scenarios:
# prometheus/rules/alerts.yml
groups:
- name: host_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 85% for 5 minutes (current: {{ $value }}%)"
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% (current: {{ $value }}%)"
- alert: DiskSpaceLow
expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Disk usage on {{ $labels.mountpoint }} is above 85%"
- alert: DiskSpaceCritical
expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space critical on {{ $labels.instance }}"
- name: container_alerts
rules:
- alert: ContainerDown
expr: absent(container_last_seen{name=~".+"})
for: 1m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} is down"
- alert: ContainerHighCPU
expr: sum(rate(container_cpu_usage_seconds_total{name!=""}[5m])) by (name) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} high CPU usage"
- alert: ContainerHighMemory
expr: container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} memory usage above 85%"
- alert: ContainerRestarting
expr: increase(container_restart_count{name!=""}[15m]) > 3
for: 0m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} restarting frequently"
Alertmanager Configuration
# alertmanager/alertmanager.yml
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: '${SMTP_PASSWORD}'
smtp_require_tls: true
route:
receiver: 'default'
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'critical'
repeat_interval: 1h
- match:
severity: warning
receiver: 'default'
repeat_interval: 4h
receivers:
- name: 'default'
email_configs:
- to: '[email protected]'
send_resolved: true
- name: 'critical'
email_configs:
- to: '[email protected]'
send_resolved: true
webhook_configs:
- url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Loki Configuration
# loki/loki.yml
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2020-10-24
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h
max_entries_limit_per_query: 5000
analytics:
reporting_enabled: false
Promtail Configuration
# promtail/promtail.yml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# Docker container logs
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
- source_labels: ['__meta_docker_container_name']
regex: '/(.*)'
target_label: 'container'
- source_labels: ['__meta_docker_container_log_stream']
target_label: 'stream'
- source_labels: ['__meta_docker_container_label_com_docker_compose_service']
target_label: 'service'
# System logs
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: syslog
__path__: /var/log/syslog
- targets:
- localhost
labels:
job: authlog
__path__: /var/log/auth.log
Grafana Dashboard Provisioning
Auto-provision data sources and dashboards:
# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: false
- name: Alertmanager
type: alertmanager
access: proxy
url: http://alertmanager:9093
editable: false
# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
Useful PromQL Queries
Reference queries for building custom dashboards:
# CPU usage percentage
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk I/O rate (bytes/sec)
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])
# Network traffic rate
rate(node_network_receive_bytes_total{device!="lo"}[5m])
rate(node_network_transmit_bytes_total{device!="lo"}[5m])
# Container CPU usage (per container)
sum(rate(container_cpu_usage_seconds_total{name!=""}[5m])) by (name) * 100
# Container memory usage
container_memory_usage_bytes{name!=""} / 1024 / 1024
# Container network I/O
rate(container_network_receive_bytes_total{name!=""}[5m])
rate(container_network_transmit_bytes_total{name!=""}[5m])
# Top 5 containers by CPU
topk(5, sum(rate(container_cpu_usage_seconds_total{name!=""}[5m])) by (name) * 100)
# Disk space remaining (hours until full)
predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 24*3600) / 1024 / 1024 / 1024
Uptime Kuma Setup
Uptime Kuma provides a clean, user-friendly interface for monitoring service availability. After deployment, access it at http://your-server:3001 and configure monitors for:
- HTTP/HTTPS endpoints: Your web applications, APIs, and admin panels
- TCP ports: Database ports, mail server ports, SSH
- DNS resolution: Verify your DNS records resolve correctly
- Docker containers: Monitor container status directly via Docker socket
- Push monitors: Accept heartbeat pings from cron jobs and scripts
Configure notification channels (email, Slack, Discord, Telegram, webhooks) for immediate alerting when services go down.
Netdata for Real-Time System Metrics
For real-time, per-second system metrics with zero configuration, Netdata is an excellent complement:
# Add to your monitoring docker-compose.yml
netdata:
image: netdata/netdata:latest
container_name: netdata
restart: unless-stopped
cap_add:
- SYS_PTRACE
- SYS_ADMIN
security_opt:
- apparmor:unconfined
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- netdata_config:/etc/netdata
- netdata_lib:/var/lib/netdata
- netdata_cache:/var/cache/netdata
ports:
- "19999:19999"
networks:
- monitoring
Netdata provides thousands of metrics out of the box with automatic dashboard generation. It excels at real-time debugging but does not replace Prometheus for long-term storage and alerting.
Integration with usulnet
If you are using usulnet for Docker container management, the built-in monitoring features complement this stack. usulnet provides container-level health checks, resource usage tracking, and alerting without the overhead of deploying a separate monitoring stack. For teams that need deeper observability (custom metrics, long-term trends, log correlation), the Prometheus/Grafana/Loki stack described here provides the full picture, while usulnet handles the container management layer.
Maintenance and Scaling
Key maintenance tasks for your monitoring stack:
- Prometheus retention: Adjust
--storage.tsdb.retention.timebased on available disk. At 15s scrape interval with 50 time series, expect roughly 1-2 GB per month. - Loki log rotation: Configure
limits_config.reject_old_samples_max_ageto prevent unbounded growth. - Grafana backup: Back up the
grafana_datavolume (includes dashboards, users, and settings). - Dashboard review: Regularly review and prune unused dashboards and alerts to reduce cognitive overhead.
- Alert fatigue: Tune thresholds to avoid false positives. An alert that fires constantly gets ignored, which is worse than no alert at all.
Rule of thumb: Every alert should be actionable. If receiving an alert does not lead to you taking a specific action, either tune the threshold, convert it to a dashboard panel, or delete it.