Cloudflare having some significant issues as well on certain domains.
I suspect DownDetector itself suffered some outages during this period, which it shows as outages of every service it monitors.
Eat your own dog food shows confidence, but monitoring it is a different dimension, you need use anything but your own dog food there.
It seems to have all the look and feel of AWS, and somehow has more up to date info than the official AWS status page?
Will large players flee because of excessive instability? Or will smaller players go from single-AZ to more expensive multi-AZ?
My guess is that no-one will leave and lots of single-AZ tenants who should be multi-AZ will use this as the impetus to do it.
Honestly, having events like this is probably good for the overall resilience of distributed systems. It's like an immune system, you don't usually fail in the same way repeatedly.
There is no possibility that outages are good for AWS. Nor is there more money to be made from "publicity" of the outages.
In the next 5 calendar years the bottom line will still grow.
However, the brand damage means they permananently lose market share. Which impacts their growth ceiling.
Yes! When you have a service interruption pay 2x more! With a region down I am sure other regions wont have any interruptions either! /s
There is zero excuse for this shit. Be professional. Acknowledge reality. It is logically impossible to run your own status page. Trying to do so just wastes everyone else on the internet's time when you have an outage.
This morning we saw some weird behavior in us-west-2, our traffic just _vanished_. I thought: there is no way this is us.
Went to https://status.aws.amazon.com/
Top of the board showed “Internet Connectivity Issues (Oregon)”
And that was that. The board worked exactly as it should - it immediately explained my missing traffic and kept me up-to-date with the status of the outage on their side.
AWS seems to be working for me, but I’ve worked with clients in the US and spectrum internet tended to drop connections to us sporadically, which looks like an outage to our clients but is something we obviously can’t control.
(This is two similarly spec'd boxes on us-east-2 and us-west-2). Looking at GeoIP of connecting clients, the only pattern I can see is the region itself.
I guess it could be an ISP thing but I guess we're all assuming 80/20.
"7:42 AM PST We are investigating Internet connectivity issues to the US-WEST-2 Region."
https://status.aws.amazon.com/
Edit: They added US-WEST-1:
"7:52 AM PST We are investigating Internet connectivity issues to the US-WEST-1 Region."
Edit: Found root case, maybe?
"8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-1 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery."
"8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-2 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery."
Their CDN, CloudFront, always works reliable for me. Couldn't they put the status page on CloudFront?
[1] https://aws.amazon.com/blogs/networking-and-content-delivery...
It amazes me how many projects exist that don't even have multi-region capability, let alone no single point of failure
All to go from, idk, 99.9% uptime to 99.95% (throwing out these numbers)? The thing is when AWS goes down so much of the internet goes down that companies don't really get called out individually.
Some users are clueless, but the clueless users average out over time and the spikes make it clear when there are actual issues.
Is this a thing?
What?
My sites run on Cloudflare and Vercel, and I can't even log in to those right now.
I'm curious — what does Hacker News run on? It seems impervious to any kind of downtime...
On a dirty, disgusting dedicated server.
Pobody's nerfect.
Gov Cloud Status Page: https://status.aws.amazon.com/govcloud
Is this enough of a push for organizations to actually move over their infrastructure to other providers?
[0] https://downdetector.com/status/crunchyroll/
[1] https://downdetector.com/status/aws-amazon-web-services/
A lot of websites use a cache in front of databases (or template rendering engines, or many other systems). That cache might evict entries based on time - after 5 minutes, the entry is considered invalid.
But that means that if you have no traffic for 10 minutes, the cache completely empties. Then when traffic returns, it all skips the cache and actually triggers a real hit to the backend - which is now overwhelmed with traffic. The cache protects the backend in normal behavior, but now it's not doing its job, so the backend has many more requests than usual.
In the worst case, those requests are enqueued in a big serial sequence... but the ones at the back of the queue may time out. The client may do something like say "it's taken me 5 seconds and I still don't have a response - I'll abort and retry!" and now you have even _more_ traffic to deal with.
So cold caches and retries can conspire to keep a service down for a long time even after the root cause is fixed.
All sorts of issues still unresolved for years, including the ridiculously annoying "Finishes playing season English sub, autoplays first season of German dub, which then gets stuck". Still no profiles (nerfing their super-premium offering). Auto-resume points are unreliable, the Android app is hot garbage at dealing with network disruption...
I can only imagine their back-end is mostly Visual Basic running on a single AWS-powered VM.
The response is that this actually works well enough, so the investment required has not pushed anyone to do it (with that meaning building the core infrastructure to make that easy).
Slack wasn't sending messages and Pagerduty was throwing 500's.
This cloud-for-everything-even-local-devices thing is both hilarious and sad.
I wonder if anyone had trouble doing their dishes or laundry today, because I'm sure someone thought dish washers and washing machines needed cloud.
The thing that really gets me is the reports from the last major outage a few days ago about how pervasive lying inside the company is. This really doesn't work well for engineering and we're possibly seeing the results of that. We should certainly expect to see that becoming visible the more time goes on without a major cultural shift. Which given that the guy who ran AWS now runs all of Amazon.com....
Discourse is reporting trouble, too. https://twitter.com/DiscourseStatus/status/14711403698992906...
> AWS Internet Connectivity (Oregon): 7:42 AM PST We are investigating Internet connectivity issues to the US-WEST-2 Region.
Source: https://status.aws.amazon.com
Let's consider RockCo and CloudCo. They both provide a B2B SAAS that is mostly used interactively during the working day, and mostly used via API calls for the rest of the working week. Demand is very much lower on weekends. Both RockCo and CloudCo were founded with a team of six people: a CEO who does sales, a CTO who can do lots of technology things, three general software developers, and one person who manages cloud services (for CloudCo) or wrangles systems and hosting (for RockCo).
In the first year, CloudCo spends less on computing than RockCo does, because CloudCo can buy spot instances of VMs in a few minutes and then stop paying for them when the job is done. RockCo needs a month to signficantly change capacity, but once they've bought it, it is relatively cheap to maintain.
In the second year, they are both growing. CloudCo buys more average capacity, but is still seeing lots of dynamic changes. RockCo keeps growing capacity.
In the third year, they're still growing. CloudCo is noticing that their bills are really high, but all of their infrastructure is oriented to dynamic allocation. They start finding places where it makes sense to keep more VMs around all the time, which cuts the costs a little. RockCo can't absorb a dynamic swing, but their bills are now significantly lower every month than CloudCo's bills, and the machines that they bought two years ago are still quite competitive. A four year replacement cycle is deemed reasonable, with capacity still growing. And bandwidth for RockCo is much cheaper than the same bandwidth for CloudCo.
Who's going to win?
Well, you can't tell. If they both got unexpectedly sudden growth surges, RockCo might not have been able to keep up. If they both got unexpected lulls, CloudCo might have been able to reduce spending temporarily. RockCo spent more up front but much less over the long term. CloudCo could have avoided hiring their cloud administrator for several months at the beginning. RockCo's systems and network engineer is not cheap. And so on, and so forth.
[07:42 AM PST] We are investigating Internet connectivity issues to the US-WEST-2 Region.
Website outages in the past hour 86,967
Lowest 16,208
Average 16,208
Highest 16,209
edit: some streams back up, chat still buggy as of 09:55 local time
edit2: appears to be back ~10:00 local time
This appears to be cross-provider.
Edit: We have IPv6 back.
EDIT: Cognito auth seems down for us too
EDIT2: our ALBs are timing out as well
EDIT3: us-west-1 looks like working now!
e.g. our services running on AWS are fine right now, but new sessions dependent on Auth0 are not.
Seems like a really bad idea.
* EC2 instances
* AWS Workspaces
* FSx for Windows
* AWS Directory Service
* S3 Buckets
"No, no, of course not"
"Should I check?"
"No, don't waste time checking, get back to your TPS reports"
Seems to be down in a major way. Lots of various AWS services are down. However, so many things depend on AWS that it could just be EC2 is down and it is causing a rippling affect.
One issue is that outbound requests from our servers us-west-2 timeout. Other than that, it seems that we are running ok so far.
HTTP Error 500 internal server error
https://www.thousandeyes.com/blog/aws-outage-analysis-decemb...
https://azycqgvwjz.share.thousandeyes.com/view/tests/?roundI...