Uptime Monitoring with AWS Route53 Health Checks

Introduction

If a site has to go down, it’s always better to be know before the client calls, or worse, social networks!

Regularly checking the availability of your site by requesting a specific URL at predetermined intervals (like every minute) allows you to react swiftly if the site becomes unavailable or when it’s back online.

While various services, like Pingdom, exist for this, and crafting a custom solution is fairly easy - there are libraries available, AWS provides a comprehensive, albeit slightly complex, solution.

By leveraging Route53 Health Checks, CloudWatch Alarms, and the Simple Notification Service (SNS), we can set up effective, reliable uptime monitoring on AWS.

Considerations and Drawbacks

Advantages:

Ease of Expansion: After the initial SNS topic setup, incorporating more alarms for new services or endpoints is straightforward.
Cost effective: It costs a couple of dollars a month, per site, making it an affordable option for continuous monitoring.
Infrastructure as code: Seamlessly integrate these health checks into your infrastructure-as-code workflows, deploying site-specific monitoring automatically alongside your site’s infrastructure, and keeping all your resources together.
Flexibility: You can trigger pretty much anything if the alarm goes off.

Drawbacks:

Initial Setup Complexity: Configuring the SNS topic and adding subscribers (email addresses, mobile numbers, or Lambda functions) can be somewhat tedious.
Potential for Lower Costs: Despite its intial affordability, there are even more cost-effective monitoring solutions available, especially for basic requirements, or large quantaties of monitored urls.
Regional Constraint: The Route53 health check, the alarm, and the SNS topic all need to reside in the us-east-1 region for this solution to work effectively.

Deployment

Resources Requried

Setting up an SNS topic to send to email or SMS is something I’ve covered in a previous post.
For this solution you will need at least one SNS topic (two if you want a seperate topic for alerts and oks) in us-east-1, peferably with some subscriptions (for example a mobile number for text messages, and email, or a lambda function).

Our Cloudformation template will deploy two resources:

Route53 Healthcheck
This will, at the given interval, request the given url and save two metrics to Cloudwatch.
Cloudwatch Alarm
This will monitor the metrics created by the healthcheck, and publish to the SNS topics when the “HealthCheckStatus” changes from 1 to 0 (or vice versa)

In the below image, the Blue line shows the “HealthCheckStatus” metric and the Orange line “HealthCheckPercentageHealthy”.
A failing healthcheck This particular outage was a DNS issue - which is why need healthchecks on the domain in front of cloudfront and not just our infrastructure behind it.

Cloudformation

AWSTemplateFormatVersion: 2010-09-09
Parameters:
  OkTopicArn:
    Type: String
    Description: SNS Topic ARN for OK messages
  AlarmTopicArn:
    Type: String
    Description: SNS Topic ARN for ALARM messages
  Domain:
    Type: String
    Default: example.com
    Description: The domain name to check
  Path:
    Type: String
    Description: The path on this domain to check
Resources:
  Route53HealthCheck:
    Type: "AWS::Route53::HealthCheck"
    Properties:
      HealthCheckConfig:
        Port: 443
        Type: HTTPS
        ResourcePath: !Ref Path
        FullyQualifiedDomainName: !Ref Domain
        RequestInterval: 30
        FailureThreshold: 3
      HealthCheckTags:
        - Key: Name
          Value: !Sub Health check - ${Domain}
  CloudwatchAlarm:
    Type: "AWS::CloudWatch::Alarm"
    Properties:
      AlarmDescription: !Sub Health check alarm - ${Domain}
      Namespace: "AWS/Route53"
      MetricName: "HealthCheckStatus"
      Dimensions:
        - Name: HealthCheckId
          Value: !Ref Route53HealthCheck
      ComparisonOperator: "LessThanThreshold"
      Period: "60"
      EvaluationPeriods: "4"
      Statistic: "Minimum"
      Threshold: "1.0"
      AlarmActions:
        - !Ref AlarmTopicArn
      OKActions:
        - !Ref OkTopicArn

You can tweak some of the variables on the alarm, to get it to go off sooner or later, as you prefer.