This post describes the setup of a dead man’s switch for Prometheus / Alertmanager in a kubernetes cluster. A primary goal of the monitoring and alerting system is to generate alerts as soon as possible when problems occur, so administrators can react promptly and the impact on users can be limited. But what happens, if the monitoring system itself is impaired? In case of an outage of the kubernetes cluster where Prometheus is installed, it is highly probable that no alerts are generated.
That is why we had set up a dead man’s switch implemented with AWS services. We use the default Watchdog
alert provided by the Prometheus Operator to the Alertmanager. It is always published when Prometheus is running, and the alert should always fire in Alertmanager during normal operations.
Alertmanager is configured to publish this alert to the deadmanswatch pod in our kubernetes cluster. Deadmanswatch publishes this alert as a metric to CloudWatch via the AWS API.
In CloudWatch, an alarm is set up, which triggers when the metric is missing data points. When Alertmanager cannot publish the Watchdog alert for any reason, then the metric will not be updated, and the alarm fires. The alarm publishes to a SNS topic, which in turn triggers an e-mail notification.
A quick overview about the flow of the Watchdog alert:
Configuration
Prometheus/Alertmanager
Prometheus sends the Watchdog alert out of the box, we just have to configure the Alermanager routing to forward the Watchdog alert
to the deadmanswatch pod. Deadmanswatch also provides a Kubernetes Service resource to be reachable via http://prometheus-deadmanswatch
Alertmanager routing
config:
receiver: default_receiver
route:
- matchers:
- alertname = Watchdog
receiver: cloudwatch
receivers:
- name: cloudwatch
webhook_configs:
- url: "http://prometheus-deadmanswatch:80/alert"
Deadmanswatch
Deadmanswatch was installed via Helm chart and this values have been configured:
deadmanswatch:
metricNamespace: Prometheus
metricName: DeadMansSwitch
awsRegion: eu-central-1
heartbeatInterval: 120s
serviceAccount: # Authentication via IRSA
name: deadmanswatch-sa
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::xxxxx:role/Deadmanswatch"
AWS
The Cloudwatch alarm was created with terraform
resource "aws_cloudwatch_metric_alarm" "prometheus-deadmanswatch" {
alarm_name = "deadmansswitch-missing"
comparison_operator = "LessThanThreshold"
metric_name = "Watchdog"
namespace = "Prometheus"
period = 60
evaluation_periods = 2
treat_missing_data = "breaching"
threshold = 1
statistic = "Minimum"
dimensions = {
source = "prometheus"
}
alarm_description = "This alarm fires when prometheus is down"
alarm_actions = [module.sns_topic_deadmanswatch.arn]
}