Prometheus and Grafana: Complete Monitoring Stack for Your Infrastructure
You cannot manage what you cannot measure. When a container silently runs out of memory, a disk fills up at 3 AM, or an application starts responding slowly, you need to know about it before your users do. Prometheus and Grafana have become the industry standard for infrastructure monitoring, and for good reason: they are open source, extremely capable, and run perfectly in Docker.
This guide builds a complete monitoring stack from scratch, covering Prometheus data collection, Grafana visualization, alerting, and Docker-specific monitoring.
Prometheus Architecture
Prometheus uses a pull-based model. Instead of applications pushing metrics to a central server, Prometheus scrapes (pulls) metrics from HTTP endpoints at regular intervals. This design has several advantages:
- Service discovery: Prometheus discovers targets automatically via DNS, Docker, Kubernetes, or static configuration.
- No agent required: Exporters expose a
/metricsendpoint that Prometheus scrapes. Many applications have built-in Prometheus support. - Failure isolation: If Prometheus goes down, applications continue running unaffected. Data gaps are visible but not destructive.
- Multi-dimensional data model: Every metric has labels (key-value pairs) that enable flexible querying.
Complete Docker Compose Stack
# /opt/docker/monitoring/docker-compose.yml
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
user: "1000:1000"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=90d'
- '--storage.tsdb.retention.size=10GB'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
ports:
- "9090:9090"
networks:
- monitoring
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
user: "1000:1000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
GF_INSTALL_PLUGINS: grafana-clock-panel,grafana-piechart-panel
GF_SERVER_ROOT_URL: https://grafana.example.com
ports:
- "3000:3000"
networks:
- monitoring
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
ports:
- "9093:9093"
networks:
- monitoring
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
pid: host
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
devices:
- /dev/kmsg
networks:
- monitoring
blackbox-exporter:
image: prom/blackbox-exporter:latest
container_name: blackbox-exporter
restart: unless-stopped
volumes:
- ./blackbox.yml:/etc/blackbox_exporter/config.yml:ro
ports:
- "9115:9115"
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
networks:
monitoring:
driver: bridge
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
rule_files:
- "alert-rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter - host metrics
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
relabel_configs:
- source_labels: [__address__]
target_label: instance
replacement: 'homelab-server'
# cAdvisor - container metrics
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
# Blackbox Exporter - endpoint probing
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://grafana.example.com
- https://nextcloud.example.com
- https://usulnet.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# Docker daemon metrics (requires daemon.json config)
- job_name: 'docker'
static_configs:
- targets: ['host.docker.internal:9323']
Key Exporters
| Exporter | Purpose | Key Metrics |
|---|---|---|
| Node Exporter | Host system metrics | CPU, memory, disk, network, load |
| cAdvisor | Container metrics | Per-container CPU, memory, network, I/O |
| Blackbox Exporter | Endpoint probing | HTTP status, latency, SSL expiry |
| Postgres Exporter | PostgreSQL metrics | Connections, queries, replication lag |
| MySQL Exporter | MySQL/MariaDB metrics | Connections, queries, InnoDB stats |
| Redis Exporter | Redis metrics | Memory, commands, keyspace |
| Nginx Exporter | Nginx metrics | Requests, connections, response times |
PromQL Basics
PromQL (Prometheus Query Language) is how you query time-series data. Learning these patterns covers 90% of monitoring use cases:
# Current CPU usage percentage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk usage percentage
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100
# Container CPU usage (per container)
rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100
# Container memory usage (in MB)
container_memory_usage_bytes{name!=""} / 1024 / 1024
# HTTP request rate (per second)
rate(http_requests_total[5m])
# HTTP error rate (5xx responses)
rate(http_requests_total{status=~"5.."}[5m])
# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Network traffic (bytes per second)
rate(node_network_receive_bytes_total{device="eth0"}[5m])
# Disk I/O operations per second
rate(node_disk_io_time_seconds_total[5m])
rate() function is the most important function in PromQL. It calculates the per-second rate of increase for counter metrics. Always use rate() with counters (metrics that only go up, like _total metrics). Use irate() for instant rate (based on last two samples) when you need more sensitive, spikier graphs.
Alert Rules
# alert-rules.yml
groups:
- name: infrastructure
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ printf \"%.1f\" $value }}% for more than 5 minutes."
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ printf \"%.1f\" $value }}%."
- alert: DiskSpaceLow
expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Disk usage is {{ printf \"%.1f\" $value }}%."
- alert: ContainerDown
expr: absent(container_last_seen{name=~".+"})
for: 1m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} is down"
- alert: ContainerHighMemory
expr: container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} memory usage is high"
description: "Container is using {{ printf \"%.1f\" $value }}% of its memory limit."
- alert: SSLCertExpiringSoon
expr: probe_ssl_earliest_cert_expiry - time() < 14 * 86400
for: 1h
labels:
severity: warning
annotations:
summary: "SSL certificate expiring soon for {{ $labels.instance }}"
description: "Certificate expires in {{ printf \"%.0f\" ($value | humanizeDuration) }}."
- alert: EndpointDown
expr: probe_success == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Endpoint {{ $labels.instance }} is down"
Alertmanager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
repeat_interval: 1h
receivers:
- name: 'default'
webhook_configs:
- url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
send_resolved: true
- name: 'critical'
webhook_configs:
- url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
send_resolved: true
# Optional: also send email for critical alerts
# email_configs:
# - to: '[email protected]'
# from: '[email protected]'
# smarthost: 'smtp.example.com:587'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Grafana Dashboard Provisioning
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: 'Infrastructure'
type: file
disableDeletion: false
editable: true
options:
path: /etc/grafana/provisioning/dashboards/json
foldersFromFilesStructure: true
Popular pre-built dashboard IDs for Grafana (import via Dashboard > Import):
- 1860: Node Exporter Full (comprehensive host metrics)
- 14282: cAdvisor (container metrics)
- 13659: Blackbox Exporter (endpoint monitoring)
- 9628: PostgreSQL Exporter
- 7362: MySQL Overview
Service Discovery for Docker
# prometheus.yml - Auto-discover Docker containers with labels
scrape_configs:
- job_name: 'docker-containers'
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 30s
relabel_configs:
# Only scrape containers with prometheus.scrape=true label
- source_labels: [__meta_docker_container_label_prometheus_scrape]
regex: 'true'
action: keep
# Use the container's prometheus.port label for the port
- source_labels: [__meta_docker_container_label_prometheus_port]
regex: '(\d+)'
target_label: __address__
replacement: '${1}'
# Use container name as instance label
- source_labels: [__meta_docker_container_name]
regex: '/(.*)'
target_label: instance
# Then label your containers:
# docker-compose.yml
services:
myapp:
image: myapp:latest
labels:
prometheus.scrape: "true"
prometheus.port: "8080"
prometheus.path: "/metrics"
Storage and Retention
# Prometheus storage configuration
# In the command section of your docker-compose.yml:
command:
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=90d' # Keep data for 90 days
- '--storage.tsdb.retention.size=10GB' # Or until 10GB is used
- '--storage.tsdb.min-block-duration=2h'
- '--storage.tsdb.max-block-duration=36h'
# Check storage usage
curl -s http://localhost:9090/api/v1/status/tsdb | python3 -m json.tool
# Estimate storage needs:
# ~1-2 bytes per sample
# 15s scrape interval = 4 samples/minute per metric
# 500 metrics * 4 samples/min * 60 min * 24 hr * 30 days
# = ~86.4M samples/month = ~170MB/month (uncompressed)
Best Practices
- Monitor the monitoring. Set up an external health check (e.g., Healthchecks.io) that alerts you if Prometheus itself goes down.
- Use recording rules for frequently queried or expensive PromQL expressions.
- Limit label cardinality. Every unique label set creates a new time series.
- Set meaningful alert thresholds. Avoid alert fatigue by only alerting on actionable conditions.
- Use Grafana variables to create reusable dashboards that work across multiple hosts.
- Back up Grafana dashboards by exporting JSON and storing in version control.
While Prometheus and Grafana provide deep monitoring capabilities, platforms like usulnet integrate monitoring directly into the container management experience, with built-in dashboards for container health, resource usage, and logs. For complex monitoring needs, usulnet complements a dedicated Prometheus/Grafana stack by providing container-specific visibility without additional configuration.