Route53 health checks

Although we have already monitoring with Prometheus in our Kubernetes cluster, we want to additionally monitor our customer systems from outside. This means that, in contrast to Prometheus monitoring, additional components for routing traffic into our Kubernetes cluster will be checked. In our AWS environment, this additional components will be checked:

  • Route53
  • the Application Load Balancer
  • the corresponding certificates at the Load Balancer
  • Target Groups
  • Ingress Controller in the cluster.

To set up the monitoring from outside the cluster we use the AWS Route53 health checks to monitor an endpoint:

You can configure a health check that monitors an endpoint that you specify either by IP address or by domain name. At regular intervals that you specify, Route 53 submits automated requests over the internet to your application, server, or other resource to verify that it’s reachable, available, and functional. Optionally, you can configure the health check to make requests similar to those that your users make, such as requesting a web page from a specific URL.

To be aware of failed health checks we have set up additional CloudWatch alarms that send mails to the monitoring teams via a SNS topic.

Configuration

Route53 Health Check

resource "aws_route53_health_check" "customer_x" {
  fqdn              = "customer-endpoint-domain"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/path/to/health_check_url"
  failure_threshold = "1"
  request_interval  = "30"
  tags              = {
    Name = "customer_x"
  }
}

CloudWatch Health Check - Alarm

Note: The referred SNS Topic in alarm_actions and the following aws_cloudwatch_metric_alarm has to be created in the AWS region Virginia (us-east-1) because Route53 is a global service and the metrics generated by the health checks for CloudWatch will be in this region. As we normally operate in a different region, we added an additional provider in terraform:

provider "aws" {
  alias = "virginia"
  region  = "us-east-1"
  profile = "xxx"
}
resource "aws_cloudwatch_metric_alarm" "metric_alarm" { 
  provider = aws.virginia

  alarm_name          = "route53-health-checks-customer_x"
  namespace           = "AWS/Route53"
  comparison_operator = "LessThanThreshold"
  metric_name         = "HealthCheckStatus"
  period              = "300"
  statistic           = "Minimum"
  threshold           = 1
  treat_missing_data  = "missing"
  dimensions          = {
    HealthCheckId = aws_route53_health_check.customer_x.id
  }
  alarm_actions             = [module.sns_topic_route53_health_checks.arn]
  alarm_description   = "Alarm for AWS Health Checks for customer_x"
  datapoints_to_alarm = 5
  evaluation_periods  = 5
}

So now in case the traffic can not be routed to the endpoint in our cluster, we will be notified.