This post describes the setup of a dead man’s switch for Prometheus / Alertmanager in a kubernetes cluster. A primary goal of the monitoring and alerting system is to generate alerts as soon as possible when problems occur, so administrators can react promptly and the impact on users can be limited. But what happens, if the monitoring system itself is impaired? In case of an outage of the kubernetes cluster where Prometheus is installed, it is highly probable that no alerts are generated.

That is why we had set up a dead man’s switch implemented with AWS services. We use the default Watchdog alert provided by the Prometheus Operator to the Alertmanager. It is always published when Prometheus is running, and the alert should always fire in Alertmanager during normal operations.

Alertmanager is configured to publish this alert to the deadmanswatch pod in our kubernetes cluster. Deadmanswatch publishes this alert as a metric to CloudWatch via the AWS API.

In CloudWatch, an alarm is set up, which triggers when the metric is missing data points. When Alertmanager cannot publish the Watchdog alert for any reason, then the metric will not be updated, and the alarm fires. The alarm publishes to a SNS topic, which in turn triggers an e-mail notification.

A quick overview about the flow of the Watchdog alert:

flowchart TD al[Alertmanager] p[Prometheus] dw[Deadmanswatch] cw[CloudWatch Metric] alarm[Alarm] sns[SNS-Topic] subgraph Kubernetes p-->|Send Watchdog Alert|al-->|Forward Watchdog alert|dw end subgraph AWS alarm-->|observes|cw dw-->|Publishes Watchdog metric|cw alarm-->|Triggers Notification|sns-->Mail_To_Emergency_Team end

Configuration

Prometheus/Alertmanager

Prometheus sends the Watchdog alert out of the box, we just have to configure the Alermanager routing to forward the Watchdog alert to the deadmanswatch pod. Deadmanswatch also provides a Kubernetes Service resource to be reachable via http://prometheus-deadmanswatch

Alertmanager routing

config:
  receiver: default_receiver
  route:
    - matchers:
        - alertname = Watchdog
      receiver: cloudwatch
  receivers:
    - name: cloudwatch
      webhook_configs:
        - url: "http://prometheus-deadmanswatch:80/alert"

Deadmanswatch

Deadmanswatch was installed via Helm chart and this values have been configured:

deadmanswatch:
  metricNamespace: Prometheus
  metricName: DeadMansSwitch
  awsRegion: eu-central-1
  heartbeatInterval: 120s
serviceAccount: # Authentication via IRSA 
  name: deadmanswatch-sa
  annotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::xxxxx:role/Deadmanswatch"

AWS

The Cloudwatch alarm was created with terraform

resource "aws_cloudwatch_metric_alarm" "prometheus-deadmanswatch" {
  alarm_name          = "deadmansswitch-missing"
  comparison_operator = "LessThanThreshold"
  metric_name         = "Watchdog"
  namespace           = "Prometheus"
  period              = 60
  evaluation_periods  = 2
  treat_missing_data  = "breaching"
  threshold           = 1
  statistic           = "Minimum"
  dimensions = {
    source = "prometheus"
  }
  alarm_description = "This alarm fires when prometheus is down"
  alarm_actions     = [module.sns_topic_deadmanswatch.arn]
}