Ask HN: Were you able to mitigate the impact of the AWS us-east-1 incident? How?

7 pointsmelor9y ago4 comments

4 comments

4 comments · 2 top-level

deftnerd9y ago· 2 in thread

I run my infrastructure on three different providers and use GeoIP assigned AnyCast DNS servers from another provider.

Asia/Australia is run on Digital Ocean, Europe is on OVH, and the Americas is on AWS.

When someone requests the IP address of my site's front-end domain or static asset CDN domain, my nameserver determines their geographic location and returns the IP address of the closest resources to them.

I run health checks so when S3 went down, which I use to host my static assets for the Americas, my nameservers quit giving out the IP addresses for the Americas systems and started giving out IP addresses for the Europe systems.

When health checks started being successful again, everything restored itself.

Due to low DNS TTL values, users in the Americas were only impacted for a few minutes and that's if the IP was cached by their system.

mydpy9y ago

Which AWS services do you use?

deftnerd9y ago

S3 for static file hosting, ec2 for caddy web server front ends (2+ depending on traffic needs), ec2 for a MySQL master (replicated to the Digital Ocean and OVH MySQL master VPS's), and Elastic Load Balancer

melorOP9y ago

We host a number of our customers' database systems on us-east-1.

What worked well for us (https://aiven.io):

- Architecturally relying only to a few cloud provider services (only need VMs, disk, object storage)

- Upfront investment on being able to move services from one region to another without downtime

- Pre-existing tooling for easily (manually) reconfiguring backup destinations on the fly

- Not running everything on just AWS

What did not work so well:

- Backups should automatically reroute to a secondary backup site on N consecutive failures

- Alert spam, need more aggregation

- New failure mode: extremely slow EBS access, some affected VMs were kinda working, but very slowly: need to create a separate alert trigger for this

j / k navigate · click thread line to collapse