365 day * 24 * 0.0001 is roughly 8 hours, so it already lost the 99.99% status.
If the server didnt work - the tool too measure didnt work too! Genius
When your SLA holds within a joke SLA window, you know you goofed.
"Five nines, but you didn't say which nines. 89.9999...", etc.
The duration of the outage in relation to that uptime is (8 h / 33602 h) * 100% = 0.024%, so the uptime is 99.976%, slightly worse than 99.99%, but clearly better than 99.90%.
They used to be five nines, and people used to say that it's not worth the while to prepare for an outage. With less than four nines, the perception might shift, but likely not enough to induce a mass migration to outage-resistant designs.
From reading the EC2 SLA I don't think this is covered. https://aws.amazon.com/compute/sla/
The reason is the SLA says "For the Instance-Level SLA, your Single EC2 Instance has no external connectivity.". Instances that were already created kept working, so this isn't covered. The SLA doesn't cover creation of new instances.
[0] Fraction is ~ 1
The refund they give you isn’t going to dent lost revenue.
We were more honest, and it probably cost us at least once in not getting business.
If you as a customer ask for 5 9s per month, with service credit of 10% of at-risk fees for missing on a deal where my GM is 30%, I can just amortise that cost and bake it into my fee.
I don't think anyone would quote availability as availability in every region I'm in?
While this is their most important region, there's a lot of clients that are probably unaffected if they're not in use1.
They COULD be affected even if they don't have anything there because of the AWS services relying on it. I'm just saying that most customers that are multi region should have their east region out and are just humming along.
Our company decided years ago to use any region other than us-east-1.
Of course, that doesn't help with services that are 'global', which usually means us-east-1.
1. The main one: it's the cheapest region, so when people select where to run their services they pick it because "why pay more?"
2. It's the default. Many tutorials and articles online show it in the examples, many deployment and other devops tools use it as a default value.
3. Related to n.2. AI models generate cloud configs and code examples with it unless asked otherwise.
4. It's location make it Europe-friendly, too. If you have a small service and you'd like to capture European and North American audience from a single location us-east-1 is a very good choice.
5. Many Amazon features are available in that region first and then spread out to other locations.
6. It's also a region where other cloud providers and hosting companies offer their services. Often there's space available in a data center not far from AWS-running racks. In hybrid cloud scenarios where you want to connect bits of your infrastructure running on AWS and on some physical hardware by a set of dedicated fiber optic lines us-east-1 is the place to do it.
7. Yes, for AWS deployments it's an experimental location that has higher risks of downtime compared to other regions, but in practice when a sizable part of us-east-1 is down other AWS services across the world tend to go down, too (along with half of the internet). So, is it really that risky to run over there, relatively speaking?
It's the world's default hosting location, and today's outages show it.
In every SKU I've ever looked at / priced out, all of the AWS NA regions have ~equal pricing. What's cheaper specifically in us-east-1?
> Europe-friendly
Why not us-east-2?
> Many Amazon features are available in that region first and then spread out to other locations.
Well, yeah, that's why it breaks. Using not-us-east-1 is like using an LTS OS release: you don't get the newest hotness, but it's much more stable as a "build it and leave it alone" target.
> It's also a region where other cloud providers and hosting companies offer their services. Often there's space available in a data center not far from AWS-running racks.
This is a better argument, but in practice, it's very niche — 2-5ms of speed-of-light delay doesn't matter to anyone but HFT folks; anyone else can be in a DC one state away with a pre-arranged tier1-bypassing direct interconnect, and do fine. (This is why OVH is listed on https://www.cloudinfrastructuremap.com/ despite being a smaller provider: their DCs have such interconnects.)
For that matter, if you want "low-latency to North America and Europe, and high-throughput lowish-latency peering to many other providers" — why not Montreal [ca-central-1]? Quebec might sound "too far north", but from the fiber-path perspective of anywhere else in NA or Europe, it's essentially interchangeable with Virginia.
This is the biggest one isn't it? I thought Route 53 isn't even available on any other region.
And looked at from the perspective of an individual company, as a customer of AWS, the occasional outage is usually an acceptable part of doing business.
However, today we’ve seen a failure that has wiped out a huge number of companies used by hundreds of millions - maybe billions - of people, and obviously a huge number of companies globally all at the same time. AWS has something like 30% of the infra market so you can imagine, and most people reading this will to some extent have experienced, the scale of disruption.
And the reality is that whilst bigger companies, like Zoom, are getting a lot of the attention here, we have no idea what other critical and/or life and death services might have been impacted. As an example that many of us would be familiar with, how many houses have been successfully burgled today because Ring has been down for around 8 out of the last 15 hours (at least as I measure it)?
I don’t think that’s OK, and I question the wisdom of companies choosing AWS as their default infra and hosting provider. It simply doesn’t seem to be very responsible to be in the same pond as so many others.
Were I a legislator I would now be casting a somewhat baleful eye at AWS as a potentially dangerous monopoly, and see what I might be able to do to force organisations to choose from amongst a much larger pool of potential infra providers and platforms, and I would be doing that because these kinds of incidents will only become more serious as time goes on.
I'm not sure a lot of companies are really looking at the costs of multi-region resiliency and hot failovers vs being down for 6 hours every year or so and writing that check.
I would think a lot of clients would want that.
On AWS's side, I think us-east-1 is legacy infrastructure because it was the first region, and things have to be made replicable.
For others on AWS who aren't AWS themselves: because AWS outbound data transfer is exorbitantly expensive. I'm building on AWS, and AWS's outbound data transfer costs are a primary design consideration for potential distribution/replication of services.
us-east-1 is so the government to slurp up all the data. /tin-foil hat
The other concerns could have to do with the impact of failover to the backup regions.
Our stuff is all in us-east-1, ops was a total shitshow today (mostly because many 3rd party services besides aws were down/slow), but our prod service was largely "ok", a total of <5% of customers were significantly impacted because existing instances got to keep running.
I think we got a bit lucky, but no actual SLAs were violated. I tagged the postmortem as Low impact despite the stress this caused internally.
We definitely learnt something here about both our software and our 3rd party dependencies.
That number had dropped to 1,190 by 4:22 AM Pacific (7:22 AM Eastern).
However, that number is back up with a vengeance. 9,230 reports as of 9:32 AM Pacific (12:32 Eastern).
Part of that could be explained by more people making reports as the U.S. west coast awoke. But I also have a feeling that they aren't yet on top of the problem.
Amazon says service is now just "degraded" and recovering, but searching for products on Amazon.com still does not work for me. https://health.aws.amazon.com/health/status
When this is fixed, I am very interested in seeing recorded spend for Sunday and Monday.
https://health.aws.amazon.com/health/status?path=open-issues
The closest to their identification of a root cause seems to be this one:
"Oct 20 8:43 AM PDT We have narrowed down the source of the network connectivity issues that impacted AWS Services. The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers. We are throttling requests for new EC2 instance launches to aid recovery and actively working on mitigations."
I.e. lots of folks that weren't expected to work today and/or trying to round them up to work the problem.
In my experience, the teams at AWS are pretty diverse, reflecting the diversity in the area. Even if a lot of the Indian employees are taking the day off, there should be plenty of other employees to back them up. A culturally diverse employee base should mitigate against this sort of problem.
If it does turn out that the outage was prolonged due to one or two key engineers being unreachable for the holiday, that's an indictment of AWS for allowing these single points of failure to occur, not for hiring Indians.
There are 153k Amazon employees based in India according to LinkedIn.
If it doesn’t stop, that means it has a battery backup. But you can still make life more bearable. Switch off all your breakers (you probably have a master breaker for this), then open up the alarm box and either pull the battery or - if it’s non-removable - take the box off the wall, put it in a sealed container, and put the sealed container somewhere… else. Somewhere you can’t hear it or can barely hear it until the battery runs down.
Meanwhile you can turn the power back on but make sure you’ve taped the bare ends of the alarm power cable, or otherwise electrically insulated them, until you’re able to reinstall it.
I would honestly do your box option. Stuff it in there with some pillows and leave it in the shed for a while.
By the way, Twilio is also down, so all those login SMS verification codes aren’t being delivered right now.
Rest and vest CEOs
He got a lot of impossible shit done as COO.
They do need a more product minded person though. If Jobs was still around we’d have smart jewelry by now. And the Apple Watch would be thin af.
A lot of these are second order dependencies like Astronomer, Atlassian, Confluent, Snowflake, Datadog, etc... the joys of using hosted solutions to everything.
When the NAS shit the bed, we lost half of production and all our run books. And we didn’t have autoscaling yet. Wouldn’t for another 2 years.
Our group is a bunch of people that has no problem getting angry and raising voices. The whole team was so volcanically angry that it got real quiet for several days. Like everyone knew if anyone unclenched that there would be assault charges.
I remember a meme years ago about Nestle. It was something like: GO ON, BOYCOT US - I BET YOU CAN’T - WE MAKE EVERYTHING.
Same meme would work for Aws today.
Not really, there are enough alternatives.
And it’s not lie there aren’t other brands of chocolate either…
https://en.wikipedia.org/wiki/List_of_military_slang_terms#F...
Not to be confused with "Foobar" which apparently originated at MIT: https://en.wikipedia.org/wiki/Foobar
TIL, an interesting footnote about "foo" there:
'During the United States v. Microsoft Corp. trial, evidence was presented that Microsoft had tried to use the Web Services Interoperability organization (WS-I) as a means to stifle competition, including e-mails in which top executives including Bill Gates and Steve Ballmer referred to the WS-I using the codename "foo".[13]'
There are documented uses of FUBAR back into the '40s.
Somewhat common. Comes from the US military in WW2.