Observability Vertical Part 1 - Collecting Metrics with Docker Compose
Series: Building a self-hosted observability stack from scratch
You just shipped something. It's running in production — a VPS, a cheap cloud box, maybe a few containers on a Hetzner instance. And you have no idea what it's actually doing.
Is memory climbing? Is that container silently restarting? Is the disk going to fill up at 3am?
This post walks you through assembling a complete metrics collection stack on a single machine using Docker Compose. No cloud vendor lock-in, no per-seat pricing, no managed service. Just four containers, a config file, and full visibility into your infrastructure.
By the end you'll have:
- Grafana Alloy collecting metrics from your host and containers
- Prometheus storing them
- Node Exporter exposing host metrics (CPU, memory, disk, network)
- cAdvisor exposing container metrics
- Grafana visualising everything with dashboards
Why this stack
There are hosted options — Grafana Cloud, Datadog, New Relic. They're good products. They're also $X/month per host, with per-metric pricing that surprises you at invoice time.
For a solo founder or small team, this stack runs on a $6/month VPS and you own all the data. The tradeoff is that you manage it. That's what this series is for.
Grafana Alloy is the piece worth explaining. It replaced Grafana Agent in 2024 as the unified collector for the Grafana ecosystem. One binary that can scrape Prometheus targets, collect logs (Loki), and receive traces (Tempo). It's the connective tissue of the entire observability stack — in Part 1 we use it for metrics, but the same Alloy instance will handle logs and traces in later parts without adding new infrastructure.
Prerequisites
- Docker and Docker Compose installed
- A Linux host (bare metal or VPS)
systemdavailable (Node Exporter runs as a host service, not in Docker)- Ports 3000, 9090, 9100, 8080 available locally
Project structure
Each service gets its own directory and docker-compose.yml. They communicate over a shared Docker bridge network called monitoring — created by the Alloy compose project and joined as external by everything else. Node Exporter is the exception: it runs as a systemd binary on the host, and Alloy reaches it via host.docker.internal.
~/.iac-toolbox/
├── grafana-alloy/
│ ├── docker-compose.yml
│ └── config.alloy
├── prometheus/
│ ├── docker-compose.yml
│ ├── prometheus.yml
│ └── recording_rules.yml
├── cadvisor/
│ └── docker-compose.yml
└── grafana/
├── docker-compose.yml
└── provisioning/
└── datasources/
└── prometheus.ymlNode Exporter — on the host, not in Docker
Before the Docker services, install Node Exporter as a systemd binary. Running it inside a container — even with pid: host and host path mounts — gives subtly wrong readings because process metrics are still scoped to the container namespace. On the host you get accurate numbers.
Download and install the binary, auto-detecting architecture for ARM64 (Raspberry Pi), AMD64, or ARMv7:
ARCH=$(uname -m)
case $ARCH in
aarch64) NE_ARCH="arm64" ;;
x86_64) NE_ARCH="amd64" ;;
*) NE_ARCH="armv7" ;;
esac
VERSION="1.8.1"
curl -L "https://github.com/prometheus/node_exporter/releases/download/v${VERSION}/node_exporter-${VERSION}.linux-${NE_ARCH}.tar.gz" \
-o /tmp/node_exporter.tar.gz
tar -xzf /tmp/node_exporter.tar.gz -C /tmp
sudo cp /tmp/node_exporter-${VERSION}.linux-${NE_ARCH}/node_exporter /usr/local/bin/
sudo chmod 755 /usr/local/bin/node_exporterCreate the systemd unit at /etc/systemd/system/node_exporter.service:
[Unit]
Description=Prometheus Node Exporter
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=pi
ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9100
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.targetEnable and start it:
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
# Verify it's up
curl -s http://localhost:9100/metrics | head -5Node Exporter is now running on port 9100 on the host. Alloy will reach it from inside Docker via host.docker.internal:9100.
The Docker Compose stack
Each service has its own compose file. They all join a shared monitoring Docker network — Alloy creates it, the others reference it as external.
Grafana Alloy (~/.iac-toolbox/grafana-alloy/docker-compose.yml):
services:
grafana-alloy:
image: grafana/alloy:v1.2.1
container_name: grafana-alloy
restart: always
ports:
- "12345:12345" # Alloy UI — pipeline graph and component health
volumes:
- ./config.alloy:/etc/alloy/config.alloy
networks:
- monitoring
command:
- run
- "--server.http.listen-addr=0.0.0.0:12345"
- "--storage.path=/var/lib/alloy/data"
- "/etc/alloy/config.alloy"
extra_hosts:
- "host.docker.internal:host-gateway" # lets Alloy reach Node Exporter on the host
networks:
monitoring:
name: monitoring
driver: bridge # Alloy owns this networkPrometheus (~/.iac-toolbox/prometheus/docker-compose.yml):
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./recording_rules.yml:/etc/prometheus/recording_rules.yml:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=15d'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-remote-write-receiver' # accepts push from Alloy
networks:
- monitoring
volumes:
prometheus_data:
networks:
monitoring:
name: monitoring
external: true # lifecycle owned by grafana-alloy compose projectcAdvisor (~/.iac-toolbox/cadvisor/docker-compose.yml):
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
container_name: cadvisor
restart: always
privileged: true # required for cgroup access
devices:
- /dev/kmsg # kernel message buffer — needed for OOM detection
ports:
- "127.0.0.1:8080:8080" # localhost only — Alloy reaches it via the network
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
networks:
- monitoring
networks:
monitoring:
external: trueGrafana (~/.iac-toolbox/grafana/docker-compose.yml):
services:
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./provisioning:/etc/grafana/provisioning:ro # auto-provisions datasources
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=changeme # change this
- GF_SERVER_ROOT_URL=http://localhost:3000
networks:
- monitoring
volumes:
grafana_data:
networks:
monitoring:
name: monitoring
driver: bridgeA few things worth noting:
extra_hosts: host.docker.internal:host-gateway on Alloy is what lets a container resolve the host machine's IP. Without it, host.docker.internal won't resolve inside Linux containers (unlike Docker Desktop on Mac, where it's built in). privileged: true on cAdvisor is required to access cgroup data for container metrics, and /dev/kmsg is the kernel message buffer — cAdvisor needs it for accurate OOM event detection, and you'll see log warnings without it.
The key Prometheus flag is --web.enable-remote-write-receiver. This is what accepts incoming pushes from Alloy at /api/v1/write. Without it, Prometheus rejects Alloy's metric payloads silently and you get an empty TSDB.
Configuring Grafana Alloy
Alloy uses its own HCL-like configuration language called River. It reads cleaner than YAML for pipeline definitions. Create ~/.iac-toolbox/grafana-alloy/config.alloy:
// ── Scrape Node Exporter (running on the host as systemd) ────────────────
prometheus.scrape "node_exporter" {
targets = [{
__address__ = "host.docker.internal:9100", // host service, not a container
instance = "my-server", // shown as host selector in dashboards
job = "node_exporter",
}]
scrape_interval = "15s"
forward_to = [prometheus.relabel.node_exporter_compat.receiver]
}
// ── Scrape cAdvisor (on the monitoring Docker network) ───────────────────
prometheus.scrape "cadvisor" {
targets = [{
__address__ = "cadvisor:8080", // reachable by container name on the network
instance = "my-server",
job = "cadvisor",
}]
scrape_interval = "15s"
forward_to = [prometheus.remote_write.platform.receiver]
}
// ── Relabel pass-through (add macOS metric renames here if needed) ────────
// On Linux this block is a no-op — metrics pass through unchanged.
// On macOS, Node Exporter uses different memory metric names than Linux.
// Add rename rules here to map them to the Linux names that dashboard 1860 expects.
prometheus.relabel "node_exporter_compat" {
forward_to = [prometheus.remote_write.platform.receiver]
}
// ── Push to Prometheus via remote_write ──────────────────────────────────
prometheus.remote_write "platform" {
endpoint {
url = "http://prometheus:9090/api/v1/write"
}
}This is Alloy's pipeline model: scrape targets → optional relabel → remote_write to Prometheus. The instance label on each scrape target is what ties metrics to a specific host in Grafana dashboards — it becomes the host dropdown in Node Exporter Full (dashboard 1860). Set it to your hostname or a meaningful identifier.
In later parts, you'll add loki.write and otelcol.receiver blocks to the same file for logs and traces without touching anything else. The Alloy UI at http://localhost:12345 shows the live pipeline graph with each component's health — useful for debugging why a target isn't being scraped.
Configuring Prometheus
Prometheus needs to know where to store data and how long to keep it. The scraping of app services is handled by Alloy (push model), so this config is minimal — Prometheus only scrapes itself. Create ~/.iac-toolbox/prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- /etc/prometheus/recording_rules.yml # recording rules for dashboard compatibility
# App service devices push metrics via Grafana Alloy remote_write.
# No static_configs for app services needed — they push via Alloy.
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']The rule_files entry points to a recording rules file. Create ~/.iac-toolbox/prometheus/recording_rules.yml:
groups:
- name: compat_rules
interval: 15s
rules:
# Node Exporter on macOS exposes swap_used_bytes but not swap_free_bytes.
# The Node Exporter Full dashboard needs SwapFree to compute SWAP Used %.
# This recording rule derives it from the two available metrics.
# On Linux this rule evaluates to zero (no-op) — harmless to keep.
- record: node_memory_SwapFree_bytes
expr: node_memory_SwapTotal_bytes - node_memory_swap_used_bytesIn Part 2, Terraform generates alert rule files and mounts them here alongside this file. The path is already wired so Part 2 requires no changes to this config.
Starting the stack
Order matters because of the shared monitoring network — Alloy creates it, so it goes first:
# 1. Start Alloy (creates the monitoring network)
cd ~/.iac-toolbox/grafana-alloy && docker compose up -d
# 2. Start Prometheus (joins the network)
cd ~/.iac-toolbox/prometheus && docker compose up -d
# 3. Start cAdvisor (joins the network)
cd ~/.iac-toolbox/cadvisor && docker compose up -d
# 4. Start Grafana (joins the network)
cd ~/.iac-toolbox/grafana && docker compose up -d
# Check all containers are running
docker ps --format "table {{.Names}}\t{{.Status}}"
# Tail logs if something looks wrong
docker logs -f grafana-alloy
docker logs -f prometheusVerify metrics are flowing end-to-end:
# Prometheus health check
curl http://localhost:9090/-/healthy
# Node Exporter on the host
curl -s http://localhost:9100/metrics | grep node_cpu_seconds_total | head -3
# Query Prometheus — should return data if Alloy's remote_write is working
curl -s 'http://localhost:9090/api/v1/query?query=node_cpu_seconds_total' \
| jq '.data.result | length'If that last query returns 0, open the Alloy UI at http://localhost:12345 — it shows a pipeline graph with the health of each component. A red prometheus.remote_write block means Prometheus is unreachable; a red prometheus.scrape block means Alloy can't reach that target.
To verify Prometheus is receiving remote_write pushes specifically:
curl -s 'http://localhost:9090/api/v1/query?query=prometheus_remote_storage_samples_in_total' \
| jq '.data.result[0].value[1]'A non-zero value confirms Alloy is successfully pushing metrics.
Setting up Grafana
Open http://localhost:3000 and log in with admin / changeme.
The Prometheus datasource is already wired up. Grafana reads provisioning/datasources/prometheus.yml at startup and creates it automatically. The provisioning file at ~/.iac-toolbox/grafana/provisioning/datasources/prometheus.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090 # container name — Grafana is on the monitoring network
isDefault: true
access: proxy
editable: falseNo need to click through Settings → Data sources. If you visit that page you'll see Prometheus is already there with a green "Data source is working" status.
Import community dashboards:
Rather than building dashboards from scratch, import the standard community ones:
| Dashboard | ID | What it shows |
|---|---|---|
| Node Exporter Full | 1860 | CPU, memory, disk, network per host |
| Docker Container Metrics | 193 | Per-container CPU, memory, restarts |
| Prometheus Stats | 3662 | Prometheus itself — useful for monitoring your monitoring |
Dashboards → Import → enter the ID → select your Prometheus data source → Import.
You can also import via the API to automate this step:
# Fetch the dashboard JSON from Grafana.com and import it
DASHBOARD_JSON=$(curl -s https://grafana.com/api/dashboards/1860 | jq '.json')
curl -s -X POST http://localhost:3000/api/dashboards/import \
-u admin:changeme \
-H 'Content-Type: application/json' \
-d "{
\"dashboard\": $DASHBOARD_JSON,
\"overwrite\": true,
\"inputs\": [{
\"name\": \"DS_PROMETHEUS\",
\"type\": \"datasource\",
\"pluginId\": \"prometheus\",
\"value\": \"Prometheus\"
}]
}"What you can see now
Once the dashboards are imported, you have visibility into:
Host level (Node Exporter)
- CPU usage and load average per core
- Memory usage, available memory, swap
- Disk usage and I/O per mount point and device
- Network traffic, errors, and drops per interface
- System load and running processes
Container level (cAdvisor)
- CPU usage per container
- Memory usage vs limit per container
- Container restart count
- Network I/O per container
- Disk I/O per container
This is your baseline. Everything has a number. When something breaks, you'll be able to look back and see exactly when metrics started deviating.
A note on data retention
--storage.tsdb.retention.time=30d keeps 30 days of metrics. For a single host, expect roughly 1-2GB of storage for that window depending on scrape frequency and metric cardinality. If disk space is tight, drop to 15d. If you want longer history, consider Thanos or Grafana Mimir — but that's well beyond the scope of this series.
What's next
You now have full visibility into your infrastructure. The next question is: who finds out when something goes wrong?
Part 2 covers building the alerting layer on top of these metrics — Prometheus alert rules, Alertmanager routing, and wiring it all to on-call notification. The entire alerting configuration is managed with Terraform, including configurable thresholds per environment.
The rule_files path you saw in prometheus.yml is already waiting for it.
The full series
| Part | Topic | Status |
|---|---|---|
| 1 | Collecting metrics — Alloy, Prometheus, Node Exporter, cAdvisor, Grafana | ✅ This post |
| 2 | Alerting layer — Prometheus rules, Alertmanager, PagerDuty via Terraform | Coming soon |
| 3 | Logs — Loki + Alloy | Planned |
| 4 | Traces — Tempo + OpenTelemetry via Alloy | Planned |
| 5 | SLOs — infrastructure SLOs with Sloth, burn rate alerts | Planned |
All configs from this post are available at github.com/iac-toolbox.