AWS CloudWatch
How to use Amazon CloudWatch Events to Monitor Application Health?
Practical example I have seen in one of the projects that I worked on
- Set up a Cloudwatch alarm to keep checking the health of a Route53 route.
- Set up an Event Rule to trigger a failover mechanism to use a secondary region when the healthcheck alarm goes on.
Internal process to test Automatic Failover
How to see which load balancer the Route53 record is resolving to? Do a nslookup
.
nslookup myservices-stage.companyvpcaws12345.companyaws.company.com
nslookup myservices-prod.companyvpcaws12345.companyaws.company.com
12345 may have to be replaced with the account number. The value that is used here needs to come from Route53. It is the value of the route in Route53.
How to simulate a failover scenario for the primary region?
Find the failover rule in AWS console and disable it. Cloudwatch -> Events -> Rules -> (search by the name of your rule or the name of your application. Look at your Cloudformation template for reference.)
Even though the Route53 health check is successful, if the Event Rule is disabled, the Cloudwatch alarm will be triggered. And this will cause the failover mechanism to kick in.
After the Event Rule is disabled, keep checking the nslookup
andd it will show that the Route53 record resolves to us-east-2 (or, whatever region your failover mechanism is configured to use).
After this, enable the Event Rule again and this will simulate the behavior of us-east-1 being in a healthy state. That will set the alarm off and the nslookup
will resolve to us-east-1 again.
A note about the time it takes for the nslookup
to resolve to the expected region
We need to look at the value of the field called Route53RecordSetTTL
in the Cloudformation template. What does the value of this mean? It means, that is the dureation for which the Route53 record value is stored in cache. That time need to elapse before Route53 can get the new value of the failover. So, the smaller this value is, the better. If this is set to a large value (like 300 seconds), when the primary region goes unhealthy, it will take 5 minutes for the failover to switch to use the secondary region. That will mean that the application is in an unhealthy state for 5 minutes.