Operations

Prometheus and Grafana: Complete Monitoring Stack for Your Infrastructure

March 28, 2025 · 22 min read

You cannot manage what you cannot measure. When a container silently runs out of memory, a disk fills up at 3 AM, or an application starts responding slowly, you need to know about it before your users do. Prometheus and Grafana have become the industry standard for infrastructure monitoring, and for good reason: they are open source, extremely capable, and run perfectly in Docker.

This guide builds a complete monitoring stack from scratch, covering Prometheus data collection, Grafana visualization, alerting, and Docker-specific monitoring.

Prometheus Architecture

Prometheus uses a pull-based model. Instead of applications pushing metrics to a central server, Prometheus scrapes (pulls) metrics from HTTP endpoints at regular intervals. This design has several advantages:

Service discovery: Prometheus discovers targets automatically via DNS, Docker, Kubernetes, or static configuration.
No agent required: Exporters expose a /metrics endpoint that Prometheus scrapes. Many applications have built-in Prometheus support.
Failure isolation: If Prometheus goes down, applications continue running unaffected. Data gaps are visible but not destructive.
Multi-dimensional data model: Every metric has labels (key-value pairs) that enable flexible querying.

Complete Docker Compose Stack

# /opt/docker/monitoring/docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    user: "1000:1000"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=90d'
      - '--storage.tsdb.retention.size=10GB'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    ports:
      - "9090:9090"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    user: "1000:1000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
      GF_INSTALL_PLUGINS: grafana-clock-panel,grafana-piechart-panel
      GF_SERVER_ROOT_URL: https://grafana.example.com
    ports:
      - "3000:3000"
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    ports:
      - "9093:9093"
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    pid: host
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    devices:
      - /dev/kmsg
    networks:
      - monitoring

  blackbox-exporter:
    image: prom/blackbox-exporter:latest
    container_name: blackbox-exporter
    restart: unless-stopped
    volumes:
      - ./blackbox.yml:/etc/blackbox_exporter/config.yml:ro
    ports:
      - "9115:9115"
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

networks:
  monitoring:
    driver: bridge

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

rule_files:
  - "alert-rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter - host metrics
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: 'homelab-server'

  # cAdvisor - container metrics
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  # Blackbox Exporter - endpoint probing
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://grafana.example.com
          - https://nextcloud.example.com
          - https://usulnet.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

  # Docker daemon metrics (requires daemon.json config)
  - job_name: 'docker'
    static_configs:
      - targets: ['host.docker.internal:9323']

Key Exporters

Exporter	Purpose	Key Metrics
Node Exporter	Host system metrics	CPU, memory, disk, network, load
cAdvisor	Container metrics	Per-container CPU, memory, network, I/O
Blackbox Exporter	Endpoint probing	HTTP status, latency, SSL expiry
Postgres Exporter	PostgreSQL metrics	Connections, queries, replication lag
MySQL Exporter	MySQL/MariaDB metrics	Connections, queries, InnoDB stats
Redis Exporter	Redis metrics	Memory, commands, keyspace
Nginx Exporter	Nginx metrics	Requests, connections, response times

PromQL Basics

PromQL (Prometheus Query Language) is how you query time-series data. Learning these patterns covers 90% of monitoring use cases:

# Current CPU usage percentage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk usage percentage
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

# Container CPU usage (per container)
rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100

# Container memory usage (in MB)
container_memory_usage_bytes{name!=""} / 1024 / 1024

# HTTP request rate (per second)
rate(http_requests_total[5m])

# HTTP error rate (5xx responses)
rate(http_requests_total{status=~"5.."}[5m])

# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Network traffic (bytes per second)
rate(node_network_receive_bytes_total{device="eth0"}[5m])

# Disk I/O operations per second
rate(node_disk_io_time_seconds_total[5m])

Tip: The rate() function is the most important function in PromQL. It calculates the per-second rate of increase for counter metrics. Always use rate() with counters (metrics that only go up, like _total metrics). Use irate() for instant rate (based on last two samples) when you need more sensitive, spikier graphs.

Alert Rules

# alert-rules.yml
groups:
  - name: infrastructure
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ printf \"%.1f\" $value }}% for more than 5 minutes."

      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ printf \"%.1f\" $value }}%."

      - alert: DiskSpaceLow
        expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Disk usage is {{ printf \"%.1f\" $value }}%."

      - alert: ContainerDown
        expr: absent(container_last_seen{name=~".+"})
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} is down"

      - alert: ContainerHighMemory
        expr: container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} memory usage is high"
          description: "Container is using {{ printf \"%.1f\" $value }}% of its memory limit."

      - alert: SSLCertExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 14 * 86400
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate expiring soon for {{ $labels.instance }}"
          description: "Certificate expires in {{ printf \"%.0f\" ($value | humanizeDuration) }}."

      - alert: EndpointDown
        expr: probe_success == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Endpoint {{ $labels.instance }} is down"

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'

  routes:
    - match:
        severity: critical
      receiver: 'critical'
      repeat_interval: 1h

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        send_resolved: true

  - name: 'critical'
    webhook_configs:
      - url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        send_resolved: true
    # Optional: also send email for critical alerts
    # email_configs:
    #   - to: '[email protected]'
    #     from: '[email protected]'
    #     smarthost: 'smtp.example.com:587'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Grafana Dashboard Provisioning

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
  - name: 'default'
    orgId: 1
    folder: 'Infrastructure'
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards/json
      foldersFromFilesStructure: true

Popular pre-built dashboard IDs for Grafana (import via Dashboard > Import):

1860: Node Exporter Full (comprehensive host metrics)
14282: cAdvisor (container metrics)
13659: Blackbox Exporter (endpoint monitoring)
9628: PostgreSQL Exporter
7362: MySQL Overview

Service Discovery for Docker

# prometheus.yml - Auto-discover Docker containers with labels
scrape_configs:
  - job_name: 'docker-containers'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 30s
    relabel_configs:
      # Only scrape containers with prometheus.scrape=true label
      - source_labels: [__meta_docker_container_label_prometheus_scrape]
        regex: 'true'
        action: keep
      # Use the container's prometheus.port label for the port
      - source_labels: [__meta_docker_container_label_prometheus_port]
        regex: '(\d+)'
        target_label: __address__
        replacement: '${1}'
      # Use container name as instance label
      - source_labels: [__meta_docker_container_name]
        regex: '/(.*)'
        target_label: instance

# Then label your containers:
# docker-compose.yml
services:
  myapp:
    image: myapp:latest
    labels:
      prometheus.scrape: "true"
      prometheus.port: "8080"
      prometheus.path: "/metrics"

Storage and Retention

# Prometheus storage configuration
# In the command section of your docker-compose.yml:
command:
  - '--storage.tsdb.path=/prometheus'
  - '--storage.tsdb.retention.time=90d'      # Keep data for 90 days
  - '--storage.tsdb.retention.size=10GB'      # Or until 10GB is used
  - '--storage.tsdb.min-block-duration=2h'
  - '--storage.tsdb.max-block-duration=36h'

# Check storage usage
curl -s http://localhost:9090/api/v1/status/tsdb | python3 -m json.tool

# Estimate storage needs:
# ~1-2 bytes per sample
# 15s scrape interval = 4 samples/minute per metric
# 500 metrics * 4 samples/min * 60 min * 24 hr * 30 days
# = ~86.4M samples/month = ~170MB/month (uncompressed)

Warning: High-cardinality labels (like user IDs, request paths, or UUIDs) cause metric explosion. Each unique label combination creates a new time series. A single metric with a high-cardinality label can generate millions of time series, consuming massive amounts of memory and disk. Use Prometheus relabeling to drop or aggregate high-cardinality labels before ingestion.

Best Practices

Monitor the monitoring. Set up an external health check (e.g., Healthchecks.io) that alerts you if Prometheus itself goes down.
Use recording rules for frequently queried or expensive PromQL expressions.
Limit label cardinality. Every unique label set creates a new time series.
Set meaningful alert thresholds. Avoid alert fatigue by only alerting on actionable conditions.
Use Grafana variables to create reusable dashboards that work across multiple hosts.
Back up Grafana dashboards by exporting JSON and storing in version control.

While Prometheus and Grafana provide deep monitoring capabilities, platforms like usulnet integrate monitoring directly into the container management experience, with built-in dashboards for container health, resource usage, and logs. For complex monitoring needs, usulnet complements a dedicated Prometheus/Grafana stack by providing container-specific visibility without additional configuration.

Prometheus Architecture

Complete Docker Compose Stack

Prometheus Configuration

Key Exporters

PromQL Basics

Alert Rules

Alertmanager Configuration

Grafana Dashboard Provisioning

Service Discovery for Docker

Storage and Retention

Best Practices

Related Articles

Docker Monitoring Guide

Docker Logging Best Practices: Centralized Logging for Containers

Docker Resource Limits