Mistakes happen all the time but when all the people who intimately know how these systems work leave for other opportunities, disasters are bound to happen more and more.
I've been an SRE on a tier 1 GCP product for over three years this is not the case. In my experience, our systems have only gotten more reliable and easier to understand since I joined.
It's not like there are only a few key knowledge holders that single handedly prevent outages from happening. In reality, you don't need to know shit about how a system works internally to prevent outages if things are setup correctly.
In theory, my dog should be able to sit on my laptop and submit a prod breaking change without any fear that it will make it to prod and damage the system because automated testing/canary should catch it and, if it does make it to prod, we should be able to detect and mitigate it before the change effects more users using probes or whitebox monitoring.
This is happens for 99.9% of potential issues and is completely invisible to users. However it's what's not caught (the remaining 0.01%) that actually matters.
Google doesn’t have nearly as hard a time retaining good people as Amazon does.
"What were your duties at your last position?" "Performing the daily ministrations and singing the praise of the machine god."
The Great Resignation had to have taken a huge toll on regular enterprises. There are probably going to be some unlucky (or lucky, depending on how hardcore they are) people in the position of maintaining aging legacy systems and retrofitting them into the future.
COBOL, for example, is becoming a lucrative language for people in the financial and insurance industries. Legacy Java is all over the place, I’m sure. Legacy .NET is in the middle of a huge industry retrofit, (.NET 5 was the official post-legacy rebrand and they’re on to .NET 6+ now).
The Great Resignation was people leaving jobs they didn't want (front of house/service industry/gig) for jobs they did want (career-track jobs) after those jobs resumed hiring again after the pandemic settled down.
Labor force participation went up not down due to 'the great resignation.'
What's causing people to believe that the latest round of attrition is any different?
If your company is hostile to people sticking around for decades, then it makes it that much less likely that you end up stuck with an machine that relies heavily on poorly documented tribal knowledge that's likely to start falling apart as your core people start cashing out.
The Great Recession
The Great Resignation
The Great Dying (due to COVID-19)
Of our senior engineers & team leads, 70% have joined in last 6-9 months.
Only 3 full time senior engineers with 2 years or more tenure.
We've grown during COVID but we've also just burned through people.
Turnover has hit the point where we stopped doing going away zoom toasts.. people just sort of disappeared.
That's the dream. Obviously there are companies that sink between v1 and v2, but that's life.
Fundamentally I think the cloud business is robust, it's a fundamentally reasonable way of organising things (for enough people), which is why it attracts customers despite being arguably more expensive.
I've been in this situation in much smaller scales, and yes, you'll see massive drop in productivity but that's the cost of going from prototype to product.
We're slowly but surely converting the world's institutional technical knowledge into re-usable and automated runbooks.
This outage is reportedly impacting 5 services in 1 region.
For those impacted, pretty terrible. But as a heavy user of AWS, I’ve seen these notices posted multiple times on HN and haven’t been impacted by one yet.
us-east-1 is their largest reason, someone told me it's 50% of their revenue
multi-region failover is awfully, awfully expensive.
In my last 7 years I imagine we had ~1 outage a month on average from AWS failures, but who knows if my imagination is accurate.
A lot of institutional knowledge in these massive tech corporations is disappearing and we're starting to reach the tipping point.
Source?
Multi AZ isn't that hard, but generally requires extra costs (one nat gw per az, etc...)
But multi region in AWS is a royal pain in the ass. Many services (like SSO) do not play well with multi region setups, making things really complicated even if you IaCed your whole stack.
(I actually love that we have strategies and infrastructure for multi-region... it just tends to come up at scales and for applications where it is not justified.)
What's the point of cloud if we have to manage robustness of their own infrastructure. I can understand if that's due to natural disasters and earthquakes, but the idea should be that a single AZ should never go down barring extraordinary circumstances. AWS should be auto-balancing, handling downtimes of a single AZ without the customer ever noticing it.
It might not be a good analogy, but if a single Cloudflare edge datacenter goes down, it will automatically route traffic through others. Transparent and painless to the customer. I understand AWS is huge, and different services have different redundancy mechanisms, but just conceptually it feels like they're in a conflict of interest to increase robustness of their data centers - "We told you to have multi-AZ deployment, not our fault".
Another way to put this is make sure as an AWS customer, to 3x multiply all costs + management of multi-AZ deployment into your total costs.
Worth deliberating on. I’m curious as to what the lifetime cost of ownership for an on-prem data center is relative to lifetime cost of operating in the cloud.
It takes a while to find a Vice President, I guess.
Having worked at a few other large tech companies now -- Amazon's incident response process is honestly great. It's one of the things I miss about working there.
Do they acknowledge the problem?
It's been a joke for years how bad us-east-1 is.
It's the only way to be sure
It is a joke.
I would delete my parent comment if I could.
EDIT: Also AWS Lambda seems to be down and AWS EC2 APIs having a very high error rate and machines slow startup times.
"If you're having SLA problems I feel bad for you son
I got two 9 problems cuz of us-east-1"