Observability Vertical Part 1.5 - Replacing Prometheus with Grafana Mimir
Series: Building a self-hosted observability stack from scratch
In Part 1 we set up Prometheus as the metrics store. It works well and the operational overhead is low. But it has two limitations that matter once you're running something real:
Alerting is coupled to Grafana. If Grafana goes down, alert evaluation stops. Your on-call doesn't get paged. For a homelab this is acceptable — for anything you care about at 2am, it isn't.
Storage is local disk only. Prometheus writes to a local TSDB. If the disk fills up, if the machine dies, or if you want more than 30 days of history, you're either running out of road or bolting on Thanos.
This post replaces Prometheus with Grafana Mimir in monolithic mode — a single-container drop-in that addresses both problems. Alloy's config barely changes. Grafana's config barely changes. But you gain a built-in Ruler that evaluates alert rules independently of Grafana, a built-in Alertmanager that routes to PagerDuty, and object storage for as much history as you want to pay S3 for.
Mimir is what Grafana Cloud runs under the hood. The monolithic mode is the same binary, just running all components in one process. It's genuinely not much harder to run than Prometheus — and it's entirely free and open source.
One thing to be honest about upfront: this post doesn't eliminate the single point of failure — it moves it. With Grafana-managed alerts, Grafana is the SPOF. After this post, Mimir is the SPOF. If Mimir goes down, the Ruler stops evaluating and the Alertmanager stops routing, same as before.
What you do gain is a meaningfully narrower failure surface — Mimir has no UI, no plugins, no user sessions, none of the operational complexity that makes Grafana crash in practice — and critically, you gain the path to actual HA. Going from one Mimir instance to three is a replication_factor: 3 config change. The Ruler uses ring-based leader election so only one instance evaluates each rule group at a time, but if it dies another takes over in seconds. That upgrade is the subject of a future post. This post sets the foundation for it.
What changes and what doesn't
| Component | Part 1 | After this post |
|---|---|---|
| Alloy | Scrapes, remote_writes to Prometheus | Same — remote_writes to Mimir instead |
| Prometheus | Stores metrics, evaluates nothing | Replaced by Mimir |
| Grafana | Queries Prometheus, evaluates alert rules | Queries Mimir, alert rules move to Mimir Ruler |
| Alertmanager | Not present | Built into Mimir, routes to PagerDuty |
| Node Exporter | Runs on host as systemd | Unchanged |
| cAdvisor | Container metrics | Unchanged |
The alert rules you wrote in Part 2 stay the same — the PromQL expressions don't change. The difference is where they're evaluated: Mimir's Ruler instead of Grafana's built-in engine. Grafana becomes a pure visualisation layer with no role in the alerting path.
Why not just Thanos?
Thanos is the other common answer to "Prometheus but with HA and long-term storage". It works by adding sidecar processes to your existing Prometheus instances and layering a query component on top. It's a good system but it keeps Prometheus in the picture — you're operating Prometheus and Thanos, and the Thanos Ruler is a separate component you have to run and wire up.
Mimir replaces Prometheus entirely with a single binary that already includes the equivalent of Thanos Sidecar, Thanos Store, Thanos Query, Thanos Ruler, and a clustered Alertmanager. For a fresh setup or a clean migration, Mimir is simpler.
Prerequisites
Everything from Part 1, plus:
- An S3-compatible object storage bucket — AWS S3, Cloudflare R2, Hetzner Object Storage, or MinIO running locally
- Bucket credentials available as environment variables
If you want to run fully local with no cloud dependency, MinIO is a drop-in S3-compatible store you can run as another Docker container. There's a MinIO setup at the end of this post for that case.
How Mimir monolithic mode works
In distributed mode, Mimir splits into ~10 separate components (distributor, ingester, querier, ruler, alertmanager, etc.) that scale independently. In monolithic mode, all of those components run in a single process behind --target=all.
The replication and storage architecture is identical — data still flows through ingesters to object storage, the ruler still evaluates independently, the alertmanager is still clustered-capable. You're just not operating them as separate processes yet. If you later need to scale, you split the components out without changing anything upstream.
For a single host or small fleet, monolithic mode is the right starting point. The operational overhead is comparable to running Prometheus.
Project structure
~/.iac-toolbox/
├── grafana-alloy/
│ ├── docker-compose.yml
│ └── config.alloy # one line changes: remote_write URL
├── mimir/
│ ├── docker-compose.yml # new
│ └── mimir.yaml # new
├── grafana/
│ ├── docker-compose.yml
│ └── provisioning/
│ ├── datasources/
│ │ └── mimir.yml # updated: points to Mimir
│ └── alerting/
│ └── alertmanager.yml # new: tells Grafana to use Mimir's Alertmanager
└── cadvisor/
└── docker-compose.yml # unchangedPrometheus is removed entirely. Everything else gets a targeted update.
Mimir configuration
Create ~/.iac-toolbox/mimir/mimir.yaml:
# Monolithic mode — all components in one process
target: all
# Multi-tenancy disabled for single-user self-hosted setup
# All metrics go under the tenant ID "anonymous"
multitenancy_enabled: false
# ── Object storage ────────────────────────────────────────────────────────
# Mimir uses object storage for three things:
# blocks — compacted TSDB blocks (long-term metrics)
# ruler — alert rule files
# alertmanager — alertmanager config and state
#
# All three can use the same bucket with different prefixes, or separate buckets.
# Using one bucket with prefixes is simpler for a self-hosted setup.
common:
storage:
backend: s3
s3:
bucket_name: ${MIMIR_S3_BUCKET}
endpoint: ${MIMIR_S3_ENDPOINT} # e.g. s3.amazonaws.com or <accountid>.r2.cloudflarestorage.com
region: ${MIMIR_S3_REGION}
access_key_id: ${MIMIR_S3_ACCESS_KEY}
secret_access_key: ${MIMIR_S3_SECRET_KEY}
blocks_storage:
s3:
bucket_name: ${MIMIR_S3_BUCKET}
tsdb:
dir: /data/tsdb # local WAL before flush to S3
retention_period: 13h # how long to keep blocks locally before S3
ruler_storage:
s3:
bucket_name: ${MIMIR_S3_BUCKET}
prefix: ruler/
alertmanager_storage:
s3:
bucket_name: ${MIMIR_S3_BUCKET}
prefix: alertmanager/
# ── Ingester ──────────────────────────────────────────────────────────────
# replication_factor: 1 for single-node (no replication)
# Raise to 3 when running multiple Mimir instances for HA
ingester:
ring:
replication_factor: 1
# ── Compactor ─────────────────────────────────────────────────────────────
# Merges small TSDB blocks in object storage into larger ones over time.
# Keeps query performance stable as data ages.
compactor:
data_dir: /data/compactor
# ── Limits ────────────────────────────────────────────────────────────────
# Defaults are set for large multi-tenant deployments.
# For a homelab you can relax ingestion limits to avoid 429s.
limits:
ingestion_rate: 10000 # samples/sec — default 10000, fine for small setups
max_global_series_per_user: 0 # 0 = unlimited
out_of_order_time_window: 5m # accept slightly out-of-order samples (useful for Alloy batching)
# ── Ruler ─────────────────────────────────────────────────────────────────
# The Ruler evaluates PromQL alert rules on a schedule, independently of Grafana.
# Alert notifications go to the Alertmanager component (also built into this binary).
ruler:
alertmanager_url: http://localhost:9009/alertmanager # internal — same process
# ── Alertmanager ──────────────────────────────────────────────────────────
# Mimir's built-in Alertmanager receives firing alerts from the Ruler
# and routes them to PagerDuty, email, or any other receiver.
# Config is stored in object storage (alertmanager/ prefix) and can be
# pushed via the API — see "Configuring Alertmanager" section below.
alertmanager:
external_url: http://localhost:9009/alertmanager
enable_api: true # enables the API to push alertmanager config
# ── Server ────────────────────────────────────────────────────────────────
server:
http_listen_port: 9009
log_level: warn # reduce noise; set to info for debuggingA few things worth understanding here:
multitenancy_enabled: false means all data is stored under the tenant anonymous. Mimir was designed for multi-tenant SaaS deployments where each customer's data is isolated. For a self-hosted single-user setup this is overhead you don't need. Disabling it means you don't have to pass an X-Scope-OrgID header on every request.
out_of_order_time_window: 5m is worth setting. Alloy batches samples before remote_write, and network jitter can mean samples arrive slightly out of chronological order. Without this, Mimir silently drops those samples. Five minutes covers any realistic batching delay.
The ruler.alertmanager_url pointing to localhost:9009/alertmanager is the Ruler talking to the Alertmanager component running in the same process. In distributed mode these would be separate hostnames.
Mimir docker-compose
~/.iac-toolbox/mimir/docker-compose.yml:
services:
mimir:
image: grafana/mimir:latest
container_name: mimir
restart: unless-stopped
ports:
- "9009:9009"
volumes:
- ./mimir.yaml:/etc/mimir/mimir.yaml:ro
- mimir_data:/data
command:
- --config.file=/etc/mimir/mimir.yaml
environment:
- MIMIR_S3_BUCKET=${MIMIR_S3_BUCKET}
- MIMIR_S3_ENDPOINT=${MIMIR_S3_ENDPOINT}
- MIMIR_S3_REGION=${MIMIR_S3_REGION}
- MIMIR_S3_ACCESS_KEY=${MIMIR_S3_ACCESS_KEY}
- MIMIR_S3_SECRET_KEY=${MIMIR_S3_SECRET_KEY}
networks:
- monitoring
volumes:
mimir_data:
networks:
monitoring:
name: monitoring
external: trueSecrets are passed as environment variables sourced from a .env file — never hardcode credentials in mimir.yaml since it ends up in version control.
Create ~/.iac-toolbox/mimir/.env (gitignored):
MIMIR_S3_BUCKET=my-mimir-metrics
MIMIR_S3_ENDPOINT=s3.amazonaws.com
MIMIR_S3_REGION=eu-west-1
MIMIR_S3_ACCESS_KEY=AKIA...
MIMIR_S3_SECRET_KEY=...Update Alloy — one line change
The only change to Alloy's config is the remote_write URL. Everything else — scrape targets, relabeling, scrape intervals — stays identical.
In ~/.iac-toolbox/grafana-alloy/config.alloy, update:
prometheus.remote_write "platform" {
endpoint {
url = "http://mimir:9009/api/v1/push"
# No headers needed — multitenancy is disabled
}
}Previously this pointed to http://prometheus:9090/api/v1/write. Same protocol, different host and path. Alloy doesn't know or care that it's now talking to Mimir.
Update Grafana datasource
Update ~/.iac-toolbox/grafana/provisioning/datasources/mimir.yml:
apiVersion: 1
datasources:
- name: Mimir
type: prometheus
url: http://mimir:9009/prometheus
isDefault: true
access: proxy
editable: false
jsonData:
httpMethod: POST
# Tells Grafana this is a Mimir/Cortex datasource — enables Ruler API integration
prometheusType: MimirThe /prometheus path is Mimir's Prometheus-compatible query API. All existing Grafana dashboards (Node Exporter Full, cAdvisor, etc.) continue to work without modification — the PromQL dialect is identical.
prometheusType: Mimir enables Grafana to talk to Mimir's Ruler API directly from the Alerting UI, so you can see rule evaluation status and firing state in Grafana without it being the evaluator.
Configuring the Alertmanager
Mimir's Alertmanager config is stored in object storage and pushed via API — not a mounted config file. This is actually cleaner for Terraform management: the config is an API resource, not a file on disk.
The Alertmanager config format is identical to standalone Alertmanager. Push it via the Mimir API:
cat <<'EOF' > /tmp/alertmanager.yaml
route:
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- matchers:
- severity = critical
receiver: pagerduty
group_wait: 0s
repeat_interval: 1h
- matchers:
- severity = warning
receiver: pagerduty
group_wait: 1m
repeat_interval: 8h
receivers:
- name: 'default'
email_configs:
- to: 'you@example.com'
from: 'alerts@example.com'
smarthost: 'smtp.example.com:587'
- name: 'pagerduty'
pagerduty_configs:
- routing_key: '<your-pagerduty-integration-key>'
severity: '{{ if eq .CommonLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
description: '{{ len .Alerts.Firing }} alert(s): {{ range .Alerts.Firing }}{{ .Labels.alertname }} {{ end }}'
details:
instance: '{{ .CommonLabels.instance }}'
environment: 'production'
EOF
# Push to Mimir's Alertmanager API
curl -s -X POST http://localhost:9009/alertmanager/api/v1/alerts \
-H "Content-Type: application/yaml" \
--data-binary @/tmp/alertmanager.yamlVerify it was accepted:
curl -s http://localhost:9009/alertmanager/api/v1/status | jq '.data.configJSON'This is also what Terraform manages — see the Terraform section below.
Loading alert rules into the Ruler
Mimir's Ruler also accepts rules via API, in the same Prometheus rule YAML format. You push rule groups to a namespace (a logical grouping, like a folder):
cat <<'EOF' > /tmp/infra-rules.yaml
groups:
- name: node_alerts
interval: 1m
rules:
- alert: NodeDown
expr: up{job="node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is unreachable"
description: "No scrape data for more than 2 minutes"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 95
for: 5m
labels:
severity: critical
annotations:
summary: "High memory on {{ $labels.instance }}"
description: "Memory usage is {{ $value | printf \"%.1f\" }}%"
- alert: SwapInUse
expr: node_memory_SwapUsed_bytes > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Swap in use on {{ $labels.instance }}"
- alert: LowDiskSpace
expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay",mountpoint="/"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay",mountpoint="/"})) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk on {{ $labels.instance }}"
description: "Disk usage is {{ $value | printf \"%.1f\" }}%"
- name: container_alerts
interval: 1m
rules:
- alert: ContainerRestarting
expr: increase(container_start_time_seconds{name!=""}[5m]) > 2
for: 1m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} is restarting"
- alert: ContainerOOMKill
expr: increase(container_oom_events_total{name!=""}[5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} was OOM killed"
- alert: ContainerCPUThrottled
expr: rate(container_cpu_cfs_throttled_seconds_total{name!=""}[5m]) / rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100 > 25
for: 10m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} is CPU throttled"
description: "{{ $value | printf \"%.1f\" }}% of CPU time is throttled"
EOF
# Push rules to Mimir under namespace "infrastructure"
curl -s -X POST \
"http://localhost:9009/prometheus/config/v1/rules/infrastructure" \
-H "Content-Type: application/yaml" \
--data-binary @/tmp/infra-rules.yamlVerify rules loaded:
curl -s http://localhost:9009/prometheus/config/v1/rules | jq 'keys'
# → ["infrastructure"]
# Check evaluation status
curl -s http://localhost:9009/prometheus/api/v1/rules | jq '.data.groups[].name'Managing everything with Terraform
In Part 2 we used Terraform to manage grafana_rule_group resources — rules evaluated by Grafana's engine. With Mimir, rules live in the Ruler and Alertmanager config lives in the Alertmanager API. The Grafana provider has native support for both via Mimir-aware resources.
Update main.tf to point alert rules at Mimir's Ruler instead of Grafana's engine:
terraform {
required_providers {
grafana = {
source = "grafana/grafana"
version = "~> 3.0"
}
pagerduty = {
source = "PagerDuty/pagerduty"
version = "~> 3.0"
}
}
}
provider "grafana" {
url = var.grafana_url
auth = "${var.grafana_admin_user}:${var.grafana_admin_password}"
}
# Second provider alias pointing directly at Mimir for Ruler/Alertmanager resources
provider "grafana" {
alias = "mimir"
url = var.mimir_url # http://localhost:9009
auth = "" # no auth on self-hosted Mimir
}Rule groups use the mimir provider alias — they go to Mimir's Ruler, not Grafana:
resource "grafana_rule_group" "infrastructure" {
provider = grafana.mimir
name = "Infrastructure Alerts"
folder_uid = "infrastructure" # namespace in Mimir's Ruler
interval_seconds = 60
rule {
name = "NodeDown"
condition = "C"
data {
ref_id = "A"
relative_time_range { from = 300; to = 0 }
datasource_uid = "__mimir__"
model = jsonencode({
expr = "up{job=\"node\"} == 0"
refId = "A"
})
}
# ... reduce and threshold stages same as Part 2
no_data_state = "Alerting"
for = "2m"
labels = { severity = "critical" }
annotations = {
summary = "Node {{ $labels.instance }} is unreachable"
}
}
# ... remaining rules identical to Part 2
}The Alertmanager config as a Terraform resource:
resource "grafana_mimir_alertmanager_config" "main" {
provider = grafana.mimir
# Same routing structure as Part 2's grafana_notification_policy
# but expressed as raw Alertmanager YAML
config_yaml = templatefile("${path.module}/templates/alertmanager.yaml.tftpl", {
pagerduty_key = local.pagerduty_enabled ? module.pagerduty[0].integration_key : ""
alert_email = var.alert_email
})
}This means terraform apply now configures the Ruler rules and Alertmanager routing in Mimir directly — Grafana is only configured for dashboards and datasource provisioning.
Starting the updated stack
Stop Prometheus first, then start Mimir. Alloy and Grafana get a rolling restart to pick up the config changes:
# Stop Prometheus — Mimir replaces it
cd ~/.iac-toolbox/prometheus && docker compose down
# Start Mimir (joins the existing monitoring network)
cd ~/.iac-toolbox/mimir && docker compose up -d
# Restart Alloy so it picks up the new remote_write URL
cd ~/.iac-toolbox/grafana-alloy && docker compose restart
# Restart Grafana so it picks up the new datasource provisioning
cd ~/.iac-toolbox/grafana && docker compose restartVerify Mimir is healthy:
# Readiness check — returns "ready" when all components have started
curl http://localhost:9009/ready
# Ingestion check — push a test sample
curl -s -X POST http://localhost:9009/api/v1/push \
-H "Content-Type: application/x-protobuf" \
--data-binary "$(printf '\x00')"
# You'll get a 400 (malformed), which confirms the endpoint is reachable
# Better ingestion check — query for recent data after a minute
curl -s 'http://localhost:9009/prometheus/api/v1/query?query=node_cpu_seconds_total' \
| jq '.data.result | length'
# Should return > 0 once Alloy has pushed a few scrapesCheck the Ruler is loaded and evaluating:
curl -s http://localhost:9009/prometheus/api/v1/rules \
| jq '.data.groups[] | {name: .name, rules: [.rules[].name]}'Check Alertmanager is running:
curl -s http://localhost:9009/alertmanager/api/v1/status \
| jq '.data.uptime'Local object storage with MinIO
If you'd rather not use a cloud S3 bucket — or want the setup to work offline — MinIO is a self-hosted S3-compatible store that runs as a single Docker container.
Add to ~/.iac-toolbox/minio/docker-compose.yml:
services:
minio:
image: minio/minio:latest
container_name: minio
restart: unless-stopped
ports:
- "9000:9000" # S3 API
- "9001:9001" # MinIO console UI
volumes:
- minio_data:/data
environment:
- MINIO_ROOT_USER=minioadmin
- MINIO_ROOT_PASSWORD=minioadmin
command: server /data --console-address ":9001"
networks:
- monitoring
volumes:
minio_data:
networks:
monitoring:
name: monitoring
external: trueCreate the bucket before starting Mimir:
cd ~/.iac-toolbox/minio && docker compose up -d
# Install mc (MinIO client) or use the console at http://localhost:9001
docker exec minio mc alias set local http://localhost:9000 minioadmin minioadmin
docker exec minio mc mb local/mimir-metricsThen update Mimir's .env to point at MinIO:
MIMIR_S3_BUCKET=mimir-metrics
MIMIR_S3_ENDPOINT=minio:9000 # container name on the monitoring network
MIMIR_S3_REGION=us-east-1 # MinIO ignores region but Mimir requires it
MIMIR_S3_ACCESS_KEY=minioadmin
MIMIR_S3_SECRET_KEY=minioadminAnd add path_style: true to the S3 config in mimir.yaml — MinIO uses path-style URLs, AWS uses virtual-hosted-style:
common:
storage:
backend: s3
s3:
bucket_name: ${MIMIR_S3_BUCKET}
endpoint: ${MIMIR_S3_ENDPOINT}
access_key_id: ${MIMIR_S3_ACCESS_KEY}
secret_access_key: ${MIMIR_S3_SECRET_KEY}
insecure: true # no TLS on local MinIO
path_style_access: true # required for MinIOThe full local stack is now completely free and self-contained: MinIO on :9000, Mimir on :9009, Grafana on :3000, all networked via Docker. No cloud accounts, no per-GB fees, no data leaving your machine.
What you've gained
Alerting is decoupled from Grafana. The Ruler evaluates rules on schedule and sends firing alerts to the Alertmanager — Grafana doesn't touch this path. If Grafana goes down, PagerDuty still gets paged.
Long-term storage. Metrics older than Prometheus's local retention window now live in object storage indefinitely. S3 standard storage costs roughly $0.023/GB/month — a year of homelab metrics is a few dollars.
A migration path to HA. When you're ready, you can run a second Mimir instance, set replication_factor: 3, and have genuinely HA ingestion and rule evaluation without changing anything in Alloy, Grafana, or your alert rules. That's the subject of a future post.
What's next
| Part | Topic | Status |
|---|---|---|
| 1 | Collecting metrics — Alloy, Prometheus, Node Exporter, cAdvisor, Grafana | ✅ Published |
| 1.5 | Replacing Prometheus with Mimir — long-term storage, decoupled alerting | ✅ This post |
| 2 | Alerting layer — Grafana alert rules, Mimir Ruler, PagerDuty via Terraform | ✅ Published |
| 3 | Logs — Loki + Alloy | Planned |
| 4 | Traces — Tempo + OpenTelemetry via Alloy | Planned |
| 5 | SLOs — Sloth, burn rate alerts | Planned |
All configs from this post are available at github.com/iac-toolbox.