Setup AWS Cloudwatch Monitoring and Alerts Using Bash Scripts (opens in new tab)

(themythicalengineer.com)

39 pointssks1474y ago30 comments

30 comments

We've been looking at making CloudWatch (CW) alarms an automated part of our infra. Here are some findings that may help:

- The semantics of CW seem convoluted. But once you stare at API docs for long enough, the core concepts are easy to grok: Metrics (regularly submitted from machine to CW), Alarms (abstractions for defining the logic of an alarm based on behavior of Metrics), and SNS Topics (could be just an email address, for what to do when an Alarm goes off).

- Once you get the data model right, all implementations (click ops, terraform, bash via awscli, boto3, etc) are all visibly identical.

- Some Metrics come for free, e.g. CPU usage is reported by any EC2 instance to CW. For some other Metrics, notably disk and memory usage, you need to configure your instance to report them to CW. This is where the OP's monitoring scripts come in.

- The monitoring scripts and the cron config the OP refers to are deprecated [0]. Instead there's a new CloudWatch Agent [1]: you install the package on your EC2 instances, provide a configuration file to it, and you're set.

[0] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scri...

[1] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...

jasonpeacock4y ago

You can install the CWAgent on any host, they don't have to be EC2 instances:

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...

coredog644y ago

High CPU alerts are terrible alerts. If I'm paying per instance, I want CPU utilization to be high. If it's low, I'm wasting money. So now what I need is an alert where it's not high, but somewhere between "high and too high". You know, like when there's an arbitrary spike because the Java is doing some GC. Or you have a one minute spike of traffic that fires an Ops Genie alert at 2am but auto-clears between when the on-call engineer wakes up and when they log in to check.

For the love of $DIETY, if you're going to set up CloudWatch monitoring, create custom metrics that map to your business outcomes and alert when those go off the rails.

sks147OP4y ago

You might want to have two separate alerts for this problem, one labelled WARNING and the other CRITICAL, such as 60 percent CPU usage as a warning and 85 percent CPU usage as a critical situation. You can have two separate SNS Topics for warning and critical alerts. Warning alerts can be thrown to a slack channel and Critical alerts can be configured to invoke the Pager.

mulmen4y ago

What do you do with the Slack messages?

vergessenmir4y ago

Custom Cloudwatch metrics are expensive to write to making them useful for coarse grained high level service metrics. If you can afford it go ahead but setting up some other cloud native monitoring service may be the way to go.

jeppesen-io4y ago

Certainly not perfect, but I've had very good success alerting load avg over 120 to 150 percent of core count

What's nice, is it catches A variety disk issues as well

I'm sure not perfect for all cases but for me, most of them

dimitar4y ago

If you are running some software that requires an instance, but is in not expected to create load you can put it in a burstable, and setup such an alert, so you know when it is time to upgrade.

orf4y ago

Not sure why you’d ever do this instead of using terraform.

hughrr4y ago

As someone who spends two hours a day dealing with buggered terraform state and upgrading terraform and dealing with terraform bugs I can see it.

It’s one of those things that really works pretty well but there are enough edge cases to make it slightly soul sucking.

gizdan4y ago

This sounds like a lack of understanding of terraform. We use Terraform pretty heavily and I've rarely seen bad states across our whole org, and the few that I do see are usually people who don't know the core concepts (often non-devops engineers).

Terraform has its faults, but it is the best in its class, especially when you need to manage infrastructure beyond a single cloud provider (e.g. we manage our datadog monitors and dashboard, pagerduty alerts and much more). The only other thing that would probably thrash it is pulumi, which has similar concepts, except you can many different languages as opposed to HCL (no CDK doesn't count because it is very immature still and last I checked it only supported one or two languages).

2 more replies

ldoughty4y ago

Agreed, I swapped my team from teraform to Ansible to SAM... SAM has been the most reliable and resilient and stable for my use cases (general serverless)

1 more reply

orf4y ago

Would it be more soul sucking than emulating it with bash, as the article is almost suggesting?

1 more reply

manderson894y ago

If you don't like Terraform then you should use CloudFormation, not bash scripts.

1 more reply

miyuru4y ago

Yes.

Using terraform for this is great is because it removes the unwanted alarms.

I had to create alarms when the instances auto scale and wrote a python script using cdktf and now the Jenkins job handles it. It even updates the cloudwatch dashboard.

qvrjuec4y ago

Or CDK... If you're writing code to generate infra why jump through more hoops than you'd need to

sks147OP4y ago

Until the team onboards a terraform expert, these scripts might be helpful and cheaper to implement.

orf4y ago

“Until the team reads the terraform QuickStart, these scripts will continue to make their infra a hellscape to manage”

ranguna4y ago

AWS Cloudwatch Monitoring & Alerts using CDK ?

j / k navigate · click thread line to collapse

30 comments

amirkdv4y ago

We've been looking at making CloudWatch (CW) alarms an automated part of our infra. Here are some findings that may help:

- Once you get the data model right, all implementations (click ops, terraform, bash via awscli, boto3, etc) are all visibly identical.

[0] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scri...

[1] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...

jasonpeacock4y ago

You can install the CWAgent on any host, they don't have to be EC2 instances:

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...

coredog644y ago

For the love of $DIETY, if you're going to set up CloudWatch monitoring, create custom metrics that map to your business outcomes and alert when those go off the rails.

sks147OP4y ago

mulmen4y ago

What do you do with the Slack messages?

vergessenmir4y ago

jeppesen-io4y ago

Certainly not perfect, but I've had very good success alerting load avg over 120 to 150 percent of core count

What's nice, is it catches A variety disk issues as well

I'm sure not perfect for all cases but for me, most of them

dimitar4y ago

If you are running some software that requires an instance, but is in not expected to create load you can put it in a burstable, and setup such an alert, so you know when it is time to upgrade.

orf4y ago

Not sure why you’d ever do this instead of using terraform.

hughrr4y ago

As someone who spends two hours a day dealing with buggered terraform state and upgrading terraform and dealing with terraform bugs I can see it.

It’s one of those things that really works pretty well but there are enough edge cases to make it slightly soul sucking.

gizdan4y ago

2 more replies

ldoughty4y ago

Agreed, I swapped my team from teraform to Ansible to SAM... SAM has been the most reliable and resilient and stable for my use cases (general serverless)

1 more reply

orf4y ago

Would it be more soul sucking than emulating it with bash, as the article is almost suggesting?

1 more reply

manderson894y ago

If you don't like Terraform then you should use CloudFormation, not bash scripts.

1 more reply

miyuru4y ago

Yes.

Using terraform for this is great is because it removes the unwanted alarms.

I had to create alarms when the instances auto scale and wrote a python script using cdktf and now the Jenkins job handles it. It even updates the cloudwatch dashboard.

qvrjuec4y ago

Or CDK... If you're writing code to generate infra why jump through more hoops than you'd need to

sks147OP4y ago

Until the team onboards a terraform expert, these scripts might be helpful and cheaper to implement.

orf4y ago

“Until the team reads the terraform QuickStart, these scripts will continue to make their infra a hellscape to manage”

ranguna4y ago

AWS Cloudwatch Monitoring & Alerts using CDK ?

j / k navigate · click thread line to collapse