- The semantics of CW seem convoluted. But once you stare at API docs for long enough, the core concepts are easy to grok: Metrics (regularly submitted from machine to CW), Alarms (abstractions for defining the logic of an alarm based on behavior of Metrics), and SNS Topics (could be just an email address, for what to do when an Alarm goes off).
- Once you get the data model right, all implementations (click ops, terraform, bash via awscli, boto3, etc) are all visibly identical.
- Some Metrics come for free, e.g. CPU usage is reported by any EC2 instance to CW. For some other Metrics, notably disk and memory usage, you need to configure your instance to report them to CW. This is where the OP's monitoring scripts come in.
- The monitoring scripts and the cron config the OP refers to are deprecated [0]. Instead there's a new CloudWatch Agent [1]: you install the package on your EC2 instances, provide a configuration file to it, and you're set.
[0] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scri...
[1] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...
For the love of $DIETY, if you're going to set up CloudWatch monitoring, create custom metrics that map to your business outcomes and alert when those go off the rails.
What's nice, is it catches A variety disk issues as well
I'm sure not perfect for all cases but for me, most of them
It’s one of those things that really works pretty well but there are enough edge cases to make it slightly soul sucking.
Terraform has its faults, but it is the best in its class, especially when you need to manage infrastructure beyond a single cloud provider (e.g. we manage our datadog monitors and dashboard, pagerduty alerts and much more). The only other thing that would probably thrash it is pulumi, which has similar concepts, except you can many different languages as opposed to HCL (no CDK doesn't count because it is very immature still and last I checked it only supported one or two languages).
Using terraform for this is great is because it removes the unwanted alarms.
I had to create alarms when the instances auto scale and wrote a python script using cdktf and now the Jenkins job handles it. It even updates the cloudwatch dashboard.