Running self-hosted services without monitoring is like driving without a dashboard. You have no idea when something is about to fail until it already has. A proper observability stack gives you metrics (what is happening), logs (why it is happening), and alerts (when something needs attention). This guide builds a complete, production-ready monitoring stack using open-source tools deployed entirely with Docker Compose.

The stack we will build covers the three pillars of observability:

  • Metrics: Prometheus for collection, Grafana for visualization, Alertmanager for notifications
  • Logs: Loki for aggregation, Promtail for collection
  • Availability: Uptime Kuma for endpoint monitoring
  • System metrics: Netdata or node_exporter for host-level metrics

Architecture Overview

Component Role RAM Usage Port
Prometheus Time-series metrics database ~200 MB base 9090
Grafana Visualization and dashboards ~150 MB 3000
Alertmanager Alert routing and deduplication ~50 MB 9093
Loki Log aggregation ~200 MB 3100
Promtail Log collection agent ~50 MB -
node_exporter Host system metrics ~20 MB 9100
cAdvisor Container metrics ~80 MB 8080
Uptime Kuma Endpoint availability monitoring ~100 MB 3001

Total RAM for the entire stack: approximately 800 MB to 1 GB, which is reasonable for a monitoring infrastructure.

Docker Compose Stack

Here is the complete Docker Compose file for the monitoring stack:

version: "3.8"

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:
  loki_data:
  uptime_kuma_data:

services:
  # --- Metrics Collection ---
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules/:/etc/prometheus/rules/:ro
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    networks:
      - monitoring

  # --- Visualization ---
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    environment:
      - GF_SECURITY_ADMIN_USER=${GRAFANA_USER:-admin}
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=https://grafana.example.com
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    ports:
      - "3000:3000"
    depends_on:
      - prometheus
      - loki
    networks:
      - monitoring

  # --- Alerting ---
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    ports:
      - "9093:9093"
    networks:
      - monitoring

  # --- Log Aggregation ---
  loki:
    image: grafana/loki:latest
    container_name: loki
    restart: unless-stopped
    command: -config.file=/etc/loki/loki.yml
    volumes:
      - ./loki/loki.yml:/etc/loki/loki.yml:ro
      - loki_data:/loki
    ports:
      - "3100:3100"
    networks:
      - monitoring

  promtail:
    image: grafana/promtail:latest
    container_name: promtail
    restart: unless-stopped
    command: -config.file=/etc/promtail/promtail.yml
    volumes:
      - ./promtail/promtail.yml:/etc/promtail/promtail.yml:ro
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    networks:
      - monitoring

  # --- Host Metrics ---
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    ports:
      - "9100:9100"
    networks:
      - monitoring

  # --- Container Metrics ---
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    ports:
      - "8080:8080"
    networks:
      - monitoring

  # --- Uptime Monitoring ---
  uptime-kuma:
    image: louislam/uptime-kuma:latest
    container_name: uptime-kuma
    restart: unless-stopped
    volumes:
      - uptime_kuma_data:/app/data
    ports:
      - "3001:3001"
    networks:
      - monitoring

Prometheus Configuration

Create the Prometheus configuration that scrapes all exporters:

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

rule_files:
  - /etc/prometheus/rules/*.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node exporter - host metrics
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  # cAdvisor - container metrics
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  # Docker daemon metrics (if enabled)
  - job_name: 'docker'
    static_configs:
      - targets: ['host.docker.internal:9323']

  # Application-specific targets
  - job_name: 'web-apps'
    metrics_path: /metrics
    static_configs:
      - targets:
        - 'app1:8080'
        - 'app2:8080'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '(.+):\d+'
        replacement: '${1}'

Alerting Rules

Define alerting rules for common failure scenarios:

# prometheus/rules/alerts.yml
groups:
  - name: host_alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 85% for 5 minutes (current: {{ $value }}%)"

      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 90% (current: {{ $value }}%)"

      - alert: DiskSpaceLow
        expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Disk usage on {{ $labels.mountpoint }} is above 85%"

      - alert: DiskSpaceCritical
        expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space critical on {{ $labels.instance }}"

  - name: container_alerts
    rules:
      - alert: ContainerDown
        expr: absent(container_last_seen{name=~".+"})
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} is down"

      - alert: ContainerHighCPU
        expr: sum(rate(container_cpu_usage_seconds_total{name!=""}[5m])) by (name) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} high CPU usage"

      - alert: ContainerHighMemory
        expr: container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} memory usage above 85%"

      - alert: ContainerRestarting
        expr: increase(container_restart_count{name!=""}[15m]) > 3
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} restarting frequently"

Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: '${SMTP_PASSWORD}'
  smtp_require_tls: true

route:
  receiver: 'default'
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'critical'
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'default'
      repeat_interval: 4h

receivers:
  - name: 'default'
    email_configs:
      - to: '[email protected]'
        send_resolved: true

  - name: 'critical'
    email_configs:
      - to: '[email protected]'
        send_resolved: true
    webhook_configs:
      - url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Loki Configuration

# loki/loki.yml
auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  max_entries_limit_per_query: 5000

analytics:
  reporting_enabled: false

Promtail Configuration

# promtail/promtail.yml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  # Docker container logs
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        regex: '/(.*)'
        target_label: 'container'
      - source_labels: ['__meta_docker_container_log_stream']
        target_label: 'stream'
      - source_labels: ['__meta_docker_container_label_com_docker_compose_service']
        target_label: 'service'

  # System logs
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: syslog
          __path__: /var/log/syslog
      - targets:
          - localhost
        labels:
          job: authlog
          __path__: /var/log/auth.log

Grafana Dashboard Provisioning

Auto-provision data sources and dashboards:

# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: false

  - name: Alertmanager
    type: alertmanager
    access: proxy
    url: http://alertmanager:9093
    editable: false
# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true
Tip: Import pre-built dashboards from Grafana.com. Dashboard ID 1860 (Node Exporter Full) and 14282 (cAdvisor) are excellent starting points. Import them via Grafana UI at Dashboards > Import, then customize as needed.

Useful PromQL Queries

Reference queries for building custom dashboards:

# CPU usage percentage
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk I/O rate (bytes/sec)
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# Network traffic rate
rate(node_network_receive_bytes_total{device!="lo"}[5m])
rate(node_network_transmit_bytes_total{device!="lo"}[5m])

# Container CPU usage (per container)
sum(rate(container_cpu_usage_seconds_total{name!=""}[5m])) by (name) * 100

# Container memory usage
container_memory_usage_bytes{name!=""} / 1024 / 1024

# Container network I/O
rate(container_network_receive_bytes_total{name!=""}[5m])
rate(container_network_transmit_bytes_total{name!=""}[5m])

# Top 5 containers by CPU
topk(5, sum(rate(container_cpu_usage_seconds_total{name!=""}[5m])) by (name) * 100)

# Disk space remaining (hours until full)
predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 24*3600) / 1024 / 1024 / 1024

Uptime Kuma Setup

Uptime Kuma provides a clean, user-friendly interface for monitoring service availability. After deployment, access it at http://your-server:3001 and configure monitors for:

  • HTTP/HTTPS endpoints: Your web applications, APIs, and admin panels
  • TCP ports: Database ports, mail server ports, SSH
  • DNS resolution: Verify your DNS records resolve correctly
  • Docker containers: Monitor container status directly via Docker socket
  • Push monitors: Accept heartbeat pings from cron jobs and scripts

Configure notification channels (email, Slack, Discord, Telegram, webhooks) for immediate alerting when services go down.

Netdata for Real-Time System Metrics

For real-time, per-second system metrics with zero configuration, Netdata is an excellent complement:

# Add to your monitoring docker-compose.yml
  netdata:
    image: netdata/netdata:latest
    container_name: netdata
    restart: unless-stopped
    cap_add:
      - SYS_PTRACE
      - SYS_ADMIN
    security_opt:
      - apparmor:unconfined
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - netdata_config:/etc/netdata
      - netdata_lib:/var/lib/netdata
      - netdata_cache:/var/cache/netdata
    ports:
      - "19999:19999"
    networks:
      - monitoring

Netdata provides thousands of metrics out of the box with automatic dashboard generation. It excels at real-time debugging but does not replace Prometheus for long-term storage and alerting.

Warning: Granting Docker socket access to monitoring containers (cAdvisor, Promtail, Netdata) effectively gives them root access to the host. In production, consider using a Docker socket proxy like Tecnativa/docker-socket-proxy to limit the API surface exposed to these containers.

Integration with usulnet

If you are using usulnet for Docker container management, the built-in monitoring features complement this stack. usulnet provides container-level health checks, resource usage tracking, and alerting without the overhead of deploying a separate monitoring stack. For teams that need deeper observability (custom metrics, long-term trends, log correlation), the Prometheus/Grafana/Loki stack described here provides the full picture, while usulnet handles the container management layer.

Maintenance and Scaling

Key maintenance tasks for your monitoring stack:

  1. Prometheus retention: Adjust --storage.tsdb.retention.time based on available disk. At 15s scrape interval with 50 time series, expect roughly 1-2 GB per month.
  2. Loki log rotation: Configure limits_config.reject_old_samples_max_age to prevent unbounded growth.
  3. Grafana backup: Back up the grafana_data volume (includes dashboards, users, and settings).
  4. Dashboard review: Regularly review and prune unused dashboards and alerts to reduce cognitive overhead.
  5. Alert fatigue: Tune thresholds to avoid false positives. An alert that fires constantly gets ignored, which is worse than no alert at all.

Rule of thumb: Every alert should be actionable. If receiving an alert does not lead to you taking a specific action, either tune the threshold, convert it to a dashboard panel, or delete it.