Back to Blog

Observability Vertical Part 3 - Making Alerts Actionable - Wiring the Alerting Platform with PagerDuty

Viktor Vasylkovskyi

Series: Building a self-hosted observability stack from scratch


In Observability Vertical Part 2 - Service Alert Provisioning we defined alert rules and provisioned them against Grafana. Here, we are going to move beyond that and ensure that firing alerts actually reach someone who can act on them.

That is the gap this post closes.

A firing alert on its own is just a state change in a database. Without routing infrastructure, it sits there until someone happens to open Grafana. This post wires the exit: PagerDuty service, Grafana contact points, and notification policy — the plumbing that takes a firing alert and turns it into a notification - page (PagerDuty) to the right person on call.

The naive approach is to use Grafana's built-in email notifications, however they are quickly limitating due to their inherent hard limits. Email doesn't wake anyone up at 3am, and there is no incident lifecycle - it fires and forgets. With PagerDuty the phone calls and push notifications that break through do-not-disturb, then acknowledgement tracking comes in handy to track the incident lifecycle. There are also other free featuers such as escalation policies that page the next person automatically if nobody responds. For a solo founder the rotation and escalation features are overkill — but the phone notification and incident lifecycle tracking are valuable from day one. And the free tier covers everything in this post. The free tier covers everything in this post.

This layer is owned by whoever runs the observability platform. It is provisioned once and rarely touched. Conversely, what we built in Observability Vertical Part 2 - Service Alert Provisioning, is owned by service teams, who writes alert rules labelled with a severity, and the platform handles the rest.


What we are Building

In this step we are going to add PagerDuty integration into Grafana. PagerDuty is a popular incident management platform that allows you to route alerts to the right on-call engineer. To keep it simple, in my project, there is only one team, and one person on call, so me, so whenever critical alert fires, we will wire it into notification to myself.

Note that this is a first integration that we are going to add that is not open source, it is a hosted SaaS with a free version that we are going to demo here.

For this to work, we need to connect grafana alerts with pagerduty on call policy, which translates in:

  • Enable Grafana to Pagerduty Integration via API key
  • Configure Grafana notification policy to use PagerDuty contact point
  • Configure PagerDuty service and escalation policy (who gets paged and in what order)

Enable Grafana to Pagerduty Integration via API key

Before running any Terraform, we are need a PagerDuty account and an API token.

Let's head to pagerduty.com and create an account. Then navigate to:

Integrations → API Access Keys → Create API Key to get your pagerduty_token.

Pagerduty API Key


Providers

Now we have the API key, we can start provisioning with terraform. First things first - provider:

# main.tf

terraform {
  required_version = ">= 1.0"
  required_providers {
    grafana = {
      source  = "grafana/grafana"
      version = "~> 3.0"
    }
    pagerduty = {
      source  = "PagerDuty/pagerduty"
      version = "~> 3.0"
    }
  }
}

provider "grafana" {
  url  = var.grafana_url
  auth = "${var.grafana_admin_user}:${var.grafana_admin_password}"
}

provider "pagerduty" {
  token          = var.pagerduty_token
  service_region = var.pagerduty_service_region
}

locals {
  pagerduty_enabled = var.pagerduty_token != ""
}

Also the variables:

# variables.tf

variable "pagerduty_token" {
  type        = string
  description = "PagerDuty API token"
}

variable "pagerduty_service_region" {
  type        = string
  description = "PagerDuty service region"
  default     = "eu" # Default US region. Supported value: us.
}

I am using eu service region because I am based in Europe. My pagerduty URL look like this: https://<company_name>.eu.pagerduty.com/. If you don't have eu then you may be in us.


Connecting Grafana to PagerDuty

Now we have two services, time to make them work together. There are couple of steps here:

  1. Establish integration between Grafana and PagerDuty
  2. Configure contact point from Grafana to PagerDuty
  3. Configure Grafana notification policy to use PagerDuty contact point

Establish Integration between Grafana and PagerDuty

When a alert is critical enough that it has to flow into PagerDuty, it goes through Events API v2 API. For that to happen, we need to generate an integration key. Conveniently it can be done using terraform. Let's create Pagerduty Terraform Module:

# modules/pagerduty/main.tf
# Events API v2 integration — supports severity-based deduplication in PagerDuty
resource "pagerduty_service_integration" "grafana" {
  name    = "Grafana"
  service = var.service
  type    = "events_api_v2_inbound_integration"
}

output "integration_key" {
  value     = pagerduty_service_integration.grafana.integration_key
  sensitive = true
}

The above code outputs the integration key. Remember the var.service from Observability Vertical Part 2 - Service Alert Provisioning? We want to ensure that alerts carry over the service label, so that we can then route the alert into the right Pagerduty service.

In PagerDuty, incidents are always assigned to a service - it's the organisational unit that owns the problem. When a critical alert fires, an incident is created on that service, and whoever is on-call for it gets notified.

Configure Contact Point from Grafana to PagerDuty

Now we can use the integration key for our contact point. Contact point defines how Grafana sends notifications to external systems like PagerDuty.

resource "grafana_contact_point" "pagerduty" {
  name  = "PagerDuty On-Call"
  pagerduty {
    integration_key = module.pagerduty[0].integration_key
    severity        = "critical"
    # Go template — renders the firing alert names directly in the PagerDuty incident title
    summary = "{% raw %}{{ len .Alerts.Firing }} alert(s) on {{ (index .Alerts.Firing 0).Labels.node }}{{ if (index .Alerts.Firing 0).Labels.service }} [{{ (index .Alerts.Firing 0).Labels.service }}]{{ end }}: {{ range .Alerts.Firing }}{{ .Annotations.summary }}; {{ end }}{% endraw %}"
  }
  lifecycle {
    create_before_destroy = true
  }
}

The severity=critical setting ensures that only critical alerts trigger PagerDuty incidents, reducing noise from less important alerts. Remember our critical alerts are the ones that signal either device/node down, container restarting, or either approaching resources limit like RAM and Disk Space.

The summary field is the message that Grafana will send to PagerDuty. The Go template renders the firing alert names directly in the PagerDuty incident title, immediately telling you what's wrong. Examples:

1 alert(s) on raspberrypi-4b: Host raspberrypi-4b is offline;
2 alert(s) on raspberrypi-4b: Low disk space on raspberrypi-4b; High memory on raspberrypi-4b;
3 alert(s) on raspberrypi-4b: Container my-app is restarting on raspberrypi-4b; Container my-app was OOM killed on raspberrypi-4b; ...

The alerts are grouped by the node and service name based on notification policy below.

Configure Grafana notification policy to use PagerDuty contact point

This is the core of the rules/policy of the platform. Here we define the routing logic that every alert rule in the system flows through.

Remember from the Observability Vertical Part 2 - Service Alert Provisioning, we have the following alerts:

AlertPromQL SignalForSeverity
NodeDownup{job="node"} drops below 12mcritical
HighMemoryUsageavailable memory falls below threshold %5mcritical
SwapInUsenode_memory_SwapUsed_bytes above 05mwarning
LowDiskSpaceroot filesystem usage exceeds threshold %5mcritical
HighCPUUsageaverage CPU idle drops below threshold %5mwarning
HighCPUTemperaturenode_hwmon_temp_celsius exceeds 75°C5mcritical
ContainerRestartingcontainer start count increases more than 2 times in 5m1mcritical
ContainerOOMKillcontainer_oom_events_total increments0scritical
ContainerHighMemorycontainer memory usage exceeds 85% of its limit5mwarning
ContainerCPUThrottledthrottled CPU time exceeds 25% of total CPU time10mwarning

So according to the policy above, all of our alerts are going to pass through the ingestion of PagerDuty. Simple.

resource "grafana_notification_policy" "main" {
  group_by      = ["alertname", "instance", "service"]
  contact_point = grafana_contact_point.pagerduty[0].name

  # Default: buffer for 30s to group related alerts, repeat every 4h
  group_wait      = "30s"
  group_interval  = "5m"
  repeat_interval = "24h"

  # Critical — immediate page, repeats hourly until resolved
  policy {
    matcher {
      label = "severity"
      match = "="
      value = "critical"
    }
    contact_point   = grafana_contact_point.pagerduty[0].name
    group_wait      = "0s"
    repeat_interval = "24h"
  }

  # Warning — 1 minute buffer to avoid flapping, repeats every 8 hours
  policy {
    matcher {
      label = "severity"
      match = "="
      value = "warning"
    }
    contact_point = grafana_contact_point.pagerduty[0].name
    group_wait      = "1m"
    repeat_interval = "24h"
  }

  depends_on = [
    grafana_contact_point.pagerduty,
  ]
}

A few decisions worth understanding:

  • group_by = ["alertname", "instance", "service"] - without instance, alerts from all hosts collapse into a single PagerDuty incident. With it, each host gets its own incident grouping so it's immediately clear which machine is the problem.
  • group_wait = "30s" at the top level - when a node goes down it often triggers several alerts simultaneously (NodeDown, HighMemory, LowDisk if the disk was filling). The 30-second buffer lets them group into a single notification. The critical policy overrides this to 0s so a node going offline pages immediately - the grouping still happens, just without the artificial delay.
  • repeat_interval - since this is our pet project, once a day pager is enough. For more business-critical this can be adjusted to become more frequent.

Service Definition

So far the the setup has been done by a platform team. It is applied once when the stack is set up and rarely touched thereafter.

Next step is to define a service that will actually use the alerting policy and escalation policy. This is the var.service definition from before:

# Reference the default escalation policy
# Free plan includes one escalation policy — reference it rather than creating a new one
data "pagerduty_escalation_policy" "default" {
  name = "Default"
}

resource "pagerduty_service" "my_service" {
  name = var.service_name
  # Note: avoid spaces in the service name — spaces cause routing issues where
  # alert messages fail to match the PagerDuty service correctly

  auto_resolve_timeout    = 172800  # auto-resolve after 48 hours if no further data arrives
  acknowledgement_timeout = 86400    # re-alert after 24 minutes if acknowledged but not resolved

  # These two timeouts reduce alert fatigue and ensure no incident is silently ignored.
  # Without auto_resolve_timeout, a stale incident from a transient blip stays open indefinitely.
  # Without acknowledgement_timeout, an engineer can acknowledge and forget.

  escalation_policy = data.pagerduty_escalation_policy.default.id
  alert_creation    = "create_alerts_and_incidents"
}

The escalation policy attached to the service answers two questions: who gets paged, and what happens if they don't respond? It defines an ordered sequence of users or teams to notify, with configurable delays between each step. On the free plan PagerDuty gives you one escalation policy, so rather than creating one we reference the existing default using a data source.

alert_creation controls how PagerDuty handles incoming events from the integration. We want to create alerts and incidents in pagerduty.

Putting it all together

Throughout this post we have provisioned pieces in isolation. Here is how they connect:

The integration key is the binding pin. It is generated by pagerduty_service_integration, which is bound to pagerduty_service.my_service. Every event Grafana pushes using that key lands on that specific service, which then applies its escalation policy to decide who gets paged.

Here is the complete Terraform that wires all four pieces together in one place:


# ── PagerDuty ─────────────────────────────────────────────────────────────

# Reference the default escalation policy
data "pagerduty_escalation_policy" "default" {
  name = "Default"
}

# The service — all incidents land here
resource "pagerduty_service" "my_service" {
  name                    = var.service_name
  escalation_policy       = data.pagerduty_escalation_policy.default.id
  alert_creation          = "create_alerts_and_incidents"
  auto_resolve_timeout    = 172800  # 48 hours
  acknowledgement_timeout = 86400   # 24 hours
}

# The integration — binds Grafana to the service and produces the key
resource "pagerduty_service_integration" "grafana" {
  name    = "Grafana"
  service = pagerduty_service.my_service.id          # ← bound to the service above
  type    = "events_api_v2_inbound_integration"
}

# ── Grafana ───────────────────────────────────────────────────────────────

# Contact point — uses the integration key to push events into PagerDuty
resource "grafana_contact_point" "pagerduty" {
  name = "PagerDuty On-Call"

  pagerduty {
    integration_key = pagerduty_service_integration.grafana.integration_key  # ← from above
    severity        = "critical"
    summary         = "{{ len .Alerts.Firing }} alert(s): {{ range .Alerts.Firing }}{{ .Labels.alertname }} {{ end }}"
  }

  lifecycle {
    create_before_destroy = true
  }
}

# Notification policy — routes firing alerts to the contact point by severity
resource "grafana_notification_policy" "main" {
  group_by      = ["alertname", "instance"]
  contact_point = grafana_contact_point.pagerduty.name  # ← from above

  group_wait      = "30s"
  group_interval  = "5m"
  repeat_interval = "24h"

  # Critical — immediate page
  policy {
    matcher {
      label = "severity"
      match = "="
      value = "critical"
    }
    contact_point   = grafana_contact_point.pagerduty.name
    group_wait      = "0s"
    repeat_interval = "24h"
  }

  # Warning — buffered, less frequent
  policy {
    matcher {
      label = "severity"
      match = "="
      value = "warning"
    }
    group_wait      = "1m"
    repeat_interval = "24h"
  }

  depends_on = [grafana_contact_point.pagerduty]
}

Making Pagerduty Module

To make abstract the complexity of integrations, we can make a module:

variable "service" {
  type        = string
  description = "PagerDuty service ID to attach the integration to"
}

# Events API v2 integration — supports severity-based deduplication in PagerDuty
resource "pagerduty_service_integration" "grafana" {
  name    = "Grafana"
  service = var.service
  type    = "events_api_v2_inbound_integration"
}

output "integration_key" {
  value     = pagerduty_service_integration.grafana.integration_key
  sensitive = true
}

Service Definition Consuming Module

And then the service is left to build the definition only:

data "pagerduty_escalation_policy" "default" {
  name = "Default"
}

resource "pagerduty_service" "my_service" {
  name                    = var.service_name
  escalation_policy       = data.pagerduty_escalation_policy.default.id
  alert_creation          = "create_alerts_and_incidents"
  auto_resolve_timeout    = 172800
  acknowledgement_timeout = 86400
}

module "pagerduty" {
  source  = "./modules/pagerduty"
  service = pagerduty_service.my_service.id
}

Grafana Contact Policy

The grafana will forward the alerts to the PagerDuty integration using the integration key provided by the module.

# Contact point — uses the integration key produced by the module
resource "grafana_contact_point" "pagerduty" {
  name = "PagerDuty On-Call"

  pagerduty {
    integration_key = module.pagerduty.integration_key  # ← output from the module
    severity        = "critical"
    summary         = "{{ len .Alerts.Firing }} alert(s): {{ range .Alerts.Firing }}{{ .Labels.alertname }} {{ end }}"
  }

  lifecycle {
    create_before_destroy = true
  }
}

# Notification policy — routes firing alerts to the contact point
resource "grafana_notification_policy" "main" {
  group_by      = ["alertname", "instance"]
  contact_point = grafana_contact_point.pagerduty.name  # ← name of the contact point above

  group_wait      = "30s"
  group_interval  = "5m"
  repeat_interval = "24h"

  policy {
    matcher {
      label = "severity"
      match = "="
      value = "critical"
    }
    contact_point   = grafana_contact_point.pagerduty.name
    group_wait      = "0s"
    repeat_interval = "24h"
  }

  policy {
    matcher {
      label = "severity"
      match = "="
      value = "warning"
    }
    group_wait      = "1m"
    repeat_interval = "24h"
  }

  depends_on = [grafana_contact_point.pagerduty]
}

So the full dependency chain in one view:

pagerduty_service.infra.id

module "pagerduty" (produces integration_key)

grafana_contact_point.pagerduty (consumes integration_key)

grafana_notification_policy.main (consumes contact_point.name)

Testing

Check that the Service has been created

Pagerduty Service