Operations

Self-Hosted Monitoring: Building a Complete Observability Stack

April 14, 2025 · 19 min read

Running self-hosted services without monitoring is like driving without a dashboard. You have no idea when something is about to fail until it already has. A proper observability stack gives you metrics (what is happening), logs (why it is happening), and alerts (when something needs attention). This guide builds a complete, production-ready monitoring stack using open-source tools deployed entirely with Docker Compose.

The stack we will build covers the three pillars of observability:

Metrics: Prometheus for collection, Grafana for visualization, Alertmanager for notifications
Logs: Loki for aggregation, Promtail for collection
Availability: Uptime Kuma for endpoint monitoring
System metrics: Netdata or node_exporter for host-level metrics

Architecture Overview

Component	Role	RAM Usage	Port
Prometheus	Time-series metrics database	~200 MB base	9090
Grafana	Visualization and dashboards	~150 MB	3000
Alertmanager	Alert routing and deduplication	~50 MB	9093
Loki	Log aggregation	~200 MB	3100
Promtail	Log collection agent	~50 MB	-
node_exporter	Host system metrics	~20 MB	9100
cAdvisor	Container metrics	~80 MB	8080
Uptime Kuma	Endpoint availability monitoring	~100 MB	3001

Total RAM for the entire stack: approximately 800 MB to 1 GB, which is reasonable for a monitoring infrastructure.

Docker Compose Stack

Here is the complete Docker Compose file for the monitoring stack:

version: "3.8"

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:
  loki_data:
  uptime_kuma_data:

services:
  # --- Metrics Collection ---
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules/:/etc/prometheus/rules/:ro
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    networks:
      - monitoring

  # --- Visualization ---
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    environment:
      - GF_SECURITY_ADMIN_USER=${GRAFANA_USER:-admin}
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=https://grafana.example.com
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    ports:
      - "3000:3000"
    depends_on:
      - prometheus
      - loki
    networks:
      - monitoring

  # --- Alerting ---
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    ports:
      - "9093:9093"
    networks:
      - monitoring

  # --- Log Aggregation ---
  loki:
    image: grafana/loki:latest
    container_name: loki
    restart: unless-stopped
    command: -config.file=/etc/loki/loki.yml
    volumes:
      - ./loki/loki.yml:/etc/loki/loki.yml:ro
      - loki_data:/loki
    ports:
      - "3100:3100"
    networks:
      - monitoring

  promtail:
    image: grafana/promtail:latest
    container_name: promtail
    restart: unless-stopped
    command: -config.file=/etc/promtail/promtail.yml
    volumes:
      - ./promtail/promtail.yml:/etc/promtail/promtail.yml:ro
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    networks:
      - monitoring

  # --- Host Metrics ---
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    ports:
      - "9100:9100"
    networks:
      - monitoring

  # --- Container Metrics ---
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    ports:
      - "8080:8080"
    networks:
      - monitoring

  # --- Uptime Monitoring ---
  uptime-kuma:
    image: louislam/uptime-kuma:latest
    container_name: uptime-kuma
    restart: unless-stopped
    volumes:
      - uptime_kuma_data:/app/data
    ports:
      - "3001:3001"
    networks:
      - monitoring

Prometheus Configuration

Create the Prometheus configuration that scrapes all exporters:

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

rule_files:
  - /etc/prometheus/rules/*.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node exporter - host metrics
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  # cAdvisor - container metrics
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  # Docker daemon metrics (if enabled)
  - job_name: 'docker'
    static_configs:
      - targets: ['host.docker.internal:9323']

  # Application-specific targets
  - job_name: 'web-apps'
    metrics_path: /metrics
    static_configs:
      - targets:
        - 'app1:8080'
        - 'app2:8080'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '(.+):\d+'
        replacement: '${1}'

Alerting Rules

Define alerting rules for common failure scenarios:

# prometheus/rules/alerts.yml
groups:
  - name: host_alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 85% for 5 minutes (current: {{ $value }}%)"

      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 90% (current: {{ $value }}%)"

      - alert: DiskSpaceLow
        expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Disk usage on {{ $labels.mountpoint }} is above 85%"

      - alert: DiskSpaceCritical
        expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space critical on {{ $labels.instance }}"

  - name: container_alerts
    rules:
      - alert: ContainerDown
        expr: absent(container_last_seen{name=~".+"})
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} is down"

      - alert: ContainerHighCPU
        expr: sum(rate(container_cpu_usage_seconds_total{name!=""}[5m])) by (name) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} high CPU usage"

      - alert: ContainerHighMemory
        expr: container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} memory usage above 85%"

      - alert: ContainerRestarting
        expr: increase(container_restart_count{name!=""}[15m]) > 3
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} restarting frequently"

Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: '${SMTP_PASSWORD}'
  smtp_require_tls: true

route:
  receiver: 'default'
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'critical'
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'default'
      repeat_interval: 4h

receivers:
  - name: 'default'
    email_configs:
      - to: '[email protected]'
        send_resolved: true

  - name: 'critical'
    email_configs:
      - to: '[email protected]'
        send_resolved: true
    webhook_configs:
      - url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Loki Configuration

# loki/loki.yml
auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  max_entries_limit_per_query: 5000

analytics:
  reporting_enabled: false

Promtail Configuration

# promtail/promtail.yml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  # Docker container logs
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        regex: '/(.*)'
        target_label: 'container'
      - source_labels: ['__meta_docker_container_log_stream']
        target_label: 'stream'
      - source_labels: ['__meta_docker_container_label_com_docker_compose_service']
        target_label: 'service'

  # System logs
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: syslog
          __path__: /var/log/syslog
      - targets:
          - localhost
        labels:
          job: authlog
          __path__: /var/log/auth.log

Grafana Dashboard Provisioning

Auto-provision data sources and dashboards:

# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: false

  - name: Alertmanager
    type: alertmanager
    access: proxy
    url: http://alertmanager:9093
    editable: false

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

Tip: Import pre-built dashboards from Grafana.com. Dashboard ID 1860 (Node Exporter Full) and 14282 (cAdvisor) are excellent starting points. Import them via Grafana UI at Dashboards > Import, then customize as needed.

Useful PromQL Queries

Reference queries for building custom dashboards:

# CPU usage percentage
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk I/O rate (bytes/sec)
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# Network traffic rate
rate(node_network_receive_bytes_total{device!="lo"}[5m])
rate(node_network_transmit_bytes_total{device!="lo"}[5m])

# Container CPU usage (per container)
sum(rate(container_cpu_usage_seconds_total{name!=""}[5m])) by (name) * 100

# Container memory usage
container_memory_usage_bytes{name!=""} / 1024 / 1024

# Container network I/O
rate(container_network_receive_bytes_total{name!=""}[5m])
rate(container_network_transmit_bytes_total{name!=""}[5m])

# Top 5 containers by CPU
topk(5, sum(rate(container_cpu_usage_seconds_total{name!=""}[5m])) by (name) * 100)

# Disk space remaining (hours until full)
predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 24*3600) / 1024 / 1024 / 1024

Uptime Kuma Setup

Uptime Kuma provides a clean, user-friendly interface for monitoring service availability. After deployment, access it at http://your-server:3001 and configure monitors for:

HTTP/HTTPS endpoints: Your web applications, APIs, and admin panels
TCP ports: Database ports, mail server ports, SSH
DNS resolution: Verify your DNS records resolve correctly
Docker containers: Monitor container status directly via Docker socket
Push monitors: Accept heartbeat pings from cron jobs and scripts

Configure notification channels (email, Slack, Discord, Telegram, webhooks) for immediate alerting when services go down.

Netdata for Real-Time System Metrics

For real-time, per-second system metrics with zero configuration, Netdata is an excellent complement:

# Add to your monitoring docker-compose.yml
  netdata:
    image: netdata/netdata:latest
    container_name: netdata
    restart: unless-stopped
    cap_add:
      - SYS_PTRACE
      - SYS_ADMIN
    security_opt:
      - apparmor:unconfined
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - netdata_config:/etc/netdata
      - netdata_lib:/var/lib/netdata
      - netdata_cache:/var/cache/netdata
    ports:
      - "19999:19999"
    networks:
      - monitoring

Netdata provides thousands of metrics out of the box with automatic dashboard generation. It excels at real-time debugging but does not replace Prometheus for long-term storage and alerting.

Warning: Granting Docker socket access to monitoring containers (cAdvisor, Promtail, Netdata) effectively gives them root access to the host. In production, consider using a Docker socket proxy like Tecnativa/docker-socket-proxy to limit the API surface exposed to these containers.

Integration with usulnet

If you are using usulnet for Docker container management, the built-in monitoring features complement this stack. usulnet provides container-level health checks, resource usage tracking, and alerting without the overhead of deploying a separate monitoring stack. For teams that need deeper observability (custom metrics, long-term trends, log correlation), the Prometheus/Grafana/Loki stack described here provides the full picture, while usulnet handles the container management layer.

Maintenance and Scaling

Key maintenance tasks for your monitoring stack:

Prometheus retention: Adjust --storage.tsdb.retention.time based on available disk. At 15s scrape interval with 50 time series, expect roughly 1-2 GB per month.
Loki log rotation: Configure limits_config.reject_old_samples_max_age to prevent unbounded growth.
Grafana backup: Back up the grafana_data volume (includes dashboards, users, and settings).
Dashboard review: Regularly review and prune unused dashboards and alerts to reduce cognitive overhead.
Alert fatigue: Tune thresholds to avoid false positives. An alert that fires constantly gets ignored, which is worse than no alert at all.

Rule of thumb: Every alert should be actionable. If receiving an alert does not lead to you taking a specific action, either tune the threshold, convert it to a dashboard panel, or delete it.

Architecture Overview

Docker Compose Stack

Prometheus Configuration

Alerting Rules

Alertmanager Configuration

Loki Configuration

Promtail Configuration

Grafana Dashboard Provisioning

Useful PromQL Queries

Uptime Kuma Setup

Netdata for Real-Time System Metrics

Integration with usulnet

Maintenance and Scaling

Related Articles

Docker Monitoring Guide

Docker Logging Best Practices: Centralized Logging for Containers

Docker Healthchecks Guide