· engineering · 3 min read
Uptime Monitoring with AWS Route53 Health Checks
Stay a step ahead of outages - implement robust, serverless uptime monitoring using AWS's Route 53, CloudWatch, and SNS.
Table of Contents
Introduction
If a site has to go down, it’s always better to be know before the client calls, or worse, social networks!
Regularly checking the availability of your site by requesting a specific URL at predetermined intervals (like every minute) allows you to react swiftly if the site becomes unavailable or when it’s back online.
While various services, like Pingdom, exist for this, and crafting a custom solution is fairly easy - there are libraries available, AWS provides a comprehensive, albeit slightly complex, solution.
By leveraging Route53 Health Checks, CloudWatch Alarms, and the Simple Notification Service (SNS), we can set up effective, reliable uptime monitoring on AWS.
Considerations and Drawbacks
Advantages:
- Ease of Expansion: After the initial SNS topic setup, incorporating more alarms for new services or endpoints is straightforward.
- Cost effective: It costs a couple of dollars a month, per site, making it an affordable option for continuous monitoring.
- Infrastructure as code: Seamlessly integrate these health checks into your infrastructure-as-code workflows, deploying site-specific monitoring automatically alongside your site’s infrastructure, and keeping all your resources together.
- Flexibility: You can trigger pretty much anything if the alarm goes off.
Drawbacks:
- Initial Setup Complexity: Configuring the SNS topic and adding subscribers (email addresses, mobile numbers, or Lambda functions) can be somewhat tedious.
- Potential for Lower Costs: Despite its intial affordability, there are even more cost-effective monitoring solutions available, especially for basic requirements, or large quantaties of monitored urls.
- Regional Constraint: The Route53 health check, the alarm, and the SNS topic all need to reside in the us-east-1 region for this solution to work effectively.
Deployment
Resources Requried
Setting up an SNS topic to send to email or SMS is something I’ve covered in a previous post.
For this solution you will need at least one SNS topic (two if you want a seperate topic for alerts and oks) in us-east-1, peferably with some subscriptions (for example a mobile number for text messages, and email, or a lambda function).
Our Cloudformation template will deploy two resources:
- Route53 Healthcheck
This will, at the given interval, request the given url and save two metrics to Cloudwatch. - Cloudwatch Alarm
This will monitor the metrics created by the healthcheck, and publish to the SNS topics when the “HealthCheckStatus” changes from 1 to 0 (or vice versa)
In the below image, the Blue line shows the “HealthCheckStatus” metric and the Orange line “HealthCheckPercentageHealthy”.
This particular outage was a DNS issue - which is why need healthchecks on the domain in front of cloudfront and not just our infrastructure behind it.
Cloudformation
AWSTemplateFormatVersion: 2010-09-09
Parameters:
OkTopicArn:
Type: String
Description: SNS Topic ARN for OK messages
AlarmTopicArn:
Type: String
Description: SNS Topic ARN for ALARM messages
Domain:
Type: String
Default: example.com
Description: The domain name to check
Path:
Type: String
Description: The path on this domain to check
Resources:
Route53HealthCheck:
Type: "AWS::Route53::HealthCheck"
Properties:
HealthCheckConfig:
Port: 443
Type: HTTPS
ResourcePath: !Ref Path
FullyQualifiedDomainName: !Ref Domain
RequestInterval: 30
FailureThreshold: 3
HealthCheckTags:
- Key: Name
Value: !Sub Health check - ${Domain}
CloudwatchAlarm:
Type: "AWS::CloudWatch::Alarm"
Properties:
AlarmDescription: !Sub Health check alarm - ${Domain}
Namespace: "AWS/Route53"
MetricName: "HealthCheckStatus"
Dimensions:
- Name: HealthCheckId
Value: !Ref Route53HealthCheck
ComparisonOperator: "LessThanThreshold"
Period: "60"
EvaluationPeriods: "4"
Statistic: "Minimum"
Threshold: "1.0"
AlarmActions:
- !Ref AlarmTopicArn
OKActions:
- !Ref OkTopicArn
You can tweak some of the variables on the alarm, to get it to go off sooner or later, as you prefer.
About James Babington
A cloud architect and engineer with a wealth of experience across AWS, web development, and security, James enjoys writing about the technical challenges and solutions he's encountered, but most of all he loves it when a plan comes together and it all just works.
No comments yet. Be the first to comment!