365 day * 24 * 0.0001 is roughly 8 hours, so it already lost the 99.99% status.
We were more honest, and it probably cost us at least once in not getting business.
I don't think anyone would quote availability as availability in every region I'm in?
While this is their most important region, there's a lot of clients that are probably unaffected if they're not in use1.
They COULD be affected even if they don't have anything there because of the AWS services relying on it. I'm just saying that most customers that are multi region should have their east region out and are just humming along.
Our company decided years ago to use any region other than us-east-1.
Of course, that doesn't help with services that are 'global', which usually means us-east-1.
1. The main one: it's the cheapest region, so when people select where to run their services they pick it because "why pay more?"
2. It's the default. Many tutorials and articles online show it in the examples, many deployment and other devops tools use it as a default value.
3. Related to n.2. AI models generate cloud configs and code examples with it unless asked otherwise.
4. It's location make it Europe-friendly, too. If you have a small service and you'd like to capture European and North American audience from a single location us-east-1 is a very good choice.
5. Many Amazon features are available in that region first and then spread out to other locations.
6. It's also a region where other cloud providers and hosting companies offer their services. Often there's space available in a data center not far from AWS-running racks. In hybrid cloud scenarios where you want to connect bits of your infrastructure running on AWS and on some physical hardware by a set of dedicated fiber optic lines us-east-1 is the place to do it.
7. Yes, for AWS deployments it's an experimental location that has higher risks of downtime compared to other regions, but in practice when a sizable part of us-east-1 is down other AWS services across the world tend to go down, too (along with half of the internet). So, is it really that risky to run over there, relatively speaking?
It's the world's default hosting location, and today's outages show it.
I'm not sure a lot of companies are really looking at the costs of multi-region resiliency and hot failovers vs being down for 6 hours every year or so and writing that check.
I would think a lot of clients would want that.
Our stuff is all in us-east-1, ops was a total shitshow today (mostly because many 3rd party services besides aws were down/slow), but our prod service was largely "ok", a total of <5% of customers were significantly impacted because existing instances got to keep running.
I think we got a bit lucky, but no actual SLAs were violated. I tagged the postmortem as Low impact despite the stress this caused internally.
We definitely learnt something here about both our software and our 3rd party dependencies.
That number had dropped to 1,190 by 4:22 AM Pacific (7:22 AM Eastern).
However, that number is back up with a vengeance. 9,230 reports as of 9:32 AM Pacific (12:32 Eastern).
Part of that could be explained by more people making reports as the U.S. west coast awoke. But I also have a feeling that they aren't yet on top of the problem.
Amazon says service is now just "degraded" and recovering, but searching for products on Amazon.com still does not work for me. https://health.aws.amazon.com/health/status
When this is fixed, I am very interested in seeing recorded spend for Sunday and Monday.
https://health.aws.amazon.com/health/status?path=open-issues
The closest to their identification of a root cause seems to be this one:
"Oct 20 8:43 AM PDT We have narrowed down the source of the network connectivity issues that impacted AWS Services. The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers. We are throttling requests for new EC2 instance launches to aid recovery and actively working on mitigations."
I.e. lots of folks that weren't expected to work today and/or trying to round them up to work the problem.
In my experience, the teams at AWS are pretty diverse, reflecting the diversity in the area. Even if a lot of the Indian employees are taking the day off, there should be plenty of other employees to back them up. A culturally diverse employee base should mitigate against this sort of problem.
If it does turn out that the outage was prolonged due to one or two key engineers being unreachable for the holiday, that's an indictment of AWS for allowing these single points of failure to occur, not for hiring Indians.
If it doesn’t stop, that means it has a battery backup. But you can still make life more bearable. Switch off all your breakers (you probably have a master breaker for this), then open up the alarm box and either pull the battery or - if it’s non-removable - take the box off the wall, put it in a sealed container, and put the sealed container somewhere… else. Somewhere you can’t hear it or can barely hear it until the battery runs down.
Meanwhile you can turn the power back on but make sure you’ve taped the bare ends of the alarm power cable, or otherwise electrically insulated them, until you’re able to reinstall it.
By the way, Twilio is also down, so all those login SMS verification codes aren’t being delivered right now.
Rest and vest CEOs
He got a lot of impossible shit done as COO.
They do need a more product minded person though. If Jobs was still around we’d have smart jewelry by now. And the Apple Watch would be thin af.
A lot of these are second order dependencies like Astronomer, Atlassian, Confluent, Snowflake, Datadog, etc... the joys of using hosted solutions to everything.
When the NAS shit the bed, we lost half of production and all our run books. And we didn’t have autoscaling yet. Wouldn’t for another 2 years.
Our group is a bunch of people that has no problem getting angry and raising voices. The whole team was so volcanically angry that it got real quiet for several days. Like everyone knew if anyone unclenched that there would be assault charges.
I remember a meme years ago about Nestle. It was something like: GO ON, BOYCOT US - I BET YOU CAN’T - WE MAKE EVERYTHING.
Same meme would work for Aws today.
Somewhat common. Comes from the US military in WW2.