https://www.theregister.com/2023/04/26/google_cloud_outage/
There is mention of a fire as well.
A few years ago I implemented a top to bottom ISO27k1 ISMS for a client handling extremely sensitive and mission-critical data for industry.
One risk I recommended controls for was that of a fire and/or flood at their primary datacentre for their client-facing offerings - this datacentre. I’ve experienced the misery of a datacentre oops myself, firsthand, twice, and it’s a genuine risk that has to be mitigated.
At my insistence, I had them burn hundreds of man-hours ensuring that they could failover to a new environment in a different datacentre with a bare minimum of fuss, as what I arrived to was an all the eggs in one basket situation. It took a fair bit of re-engineering of how deployments worked, how data was replicated, how the environment was configured - but they got there, and the ISMS was put into operation, and was audited cleanly by a reputable auditor, and everyone lived happily ever after.
Except… they were acquired by private equity. Who had no truck with all of this costly prancing about with consultants and systems. Risk register? Why do we need this? What value does it add today? ISO27k1? Don’t be silly. We have that certificate. You don’t need it. Dev team, ops team, leadership — almost everyone — ejected and replaced with a few support staff.
I see their sites are down.
Also, why batteries in a datacenter? When you implement a flush() command at the lowest level you're faced with two choices: 1) actually write to disk, then return from the call, 2) write to some cache/RAM and have just enough battery locally to ensure that you can write it to disk even if all power goes out.
Then there's the other problem of surviving long enough between a power interruption and diesel generators starting up. But this is a smaller problem, rebooting all instances in a datacenter is less bad than losing some data that was correctly flush()ed by software. Bad flush() behaviour can result in errors that cannot be recovered from without a complicated manual intervention (for example if it causes corrupted and unreadable database files).
I think this is sort of big.
> Google designs zones to minimize the risk of correlated failures caused by physical infrastructure outages
And they have stated that the flood "caused a multi-cluster failure".
> Zones should be considered a single failure domain within a region.
(—GCP's documentation.)
A nice thing about EC2 is that you're getting a pretty dumb, predictable service. There have been multi-zone or global control plane issues but the physical metal has bona fide redundancy between zones/regions.
"GCE Global Control Plane: Experienced a global outage, which has been mitigated. Primary impact was observed from 2023-04-25 23:15:20 PDT to 2023-04-26 03:45:30 PDT and impacted customers utilizing Global DNS (gDNS). A secondary global impact for aggregated list operation failures for customers with resources in europe-west9 has also been mitigated. Please see migration guide for gDNS to Zonal DNS for more information: https://cloud.google.com/compute/docs/internal-dns#migrating... "
Embarrassing.
Alphabet is most ruled by arbitrary fiat of empire-builders.
Killing products and footgun changes are standard.
That goes without saying at this point. More importantly, it’s proven worse than Azure.
On the other hand, Azure was(and still is) upfront about not having AZs - now that they have rolled out, hopefully those are not in the same building.
Someone in my freshman college dorm decided to use one as a clothes hanger hook and broke the thermometer in there. The sprinkler damaged the entire floor with water and the floor below had spotty rain as well.
The fire department came and was mainly concerned about evacuating everyone rather than shutting the water off.
The water is typically chemically treated and has been sitting there for years as well -- very nasty stuff.
The fire department is always going to prioritize safety of life, and after all it’s not their stuff getting soaked.
They won't hesitate to smash your stuff or break down your walls either.
Being in a fire is no joke. You've got to be crazy to think that your stuff is important. It's not.
Your fingerprints won’t survive the fire!
You will not care about your stuff when you're in jail for negligent manslaughter.
Hey Dang, thanks for cleaning up the thread. One thing to note is that the title is not correct. The entire region is not currently down, as the regional impact was mitigated as of 06:39 PDT, per the support dashboard (though I think it was earlier). The impact is currently zonal (europe-west9-a), so having zone in the title as opposed to region would reflect reality closer.
Finally, there's lots of good feedback on this thread and on the previous one (https://news.ycombinator.com/item?id=35711349), so we obviously have a lot of lessons to learn.
Was there a lot of anxiety? Panic? Or was it just a “woof that sucks. Time to follow a checklist and then do a bunch of paper work” ?
What I’m curious about is what it feels like on a team at a company like Google when there is a major system failure.
Personally, I wasn't part this time for the actual mitigation of the overall Paris DC recovery, as I was busy with an unfortunate[0] side effect of the outage. These generate more anxiety, as being woken up at 6am and being told that nobody understands exactly why the system is acting this way is not great. But then again, we're trained for this situation and there are always at least several ways of fixing the issue.
Finally, it's worth repeating that incident management is just a part of the SRE job and after several years I've understood that it is not the most important one. The best SREs I know are not great when it comes to a huge incident. But, they're work has avoided the other 99 outages that could have appeared on the front page of Hacker News.
Based on that thread it sounds like only AWS guarantees that their AZs are in physically separate DCs, while for Google and Microsoft AZs could be in separate buildings of the same DC facility.
Other cloud providers mostly just vaguely put things in another part of the building and say it’s “a separate AZ” but as GCPs woes highlighted that’s corner cutting that bites badly when the whole building has a problem.
It's incredibly rare for multiple AZs to go down at once, especially since they are more than a few miles apart from each other.
It says "reserved for future use" but other docs mentioned "physical zone separation": https://googleapis.dev/java/google-api-services-compute/alph...
Like, Target does not compete with Amazon. They have a totally different home delivery model that is not in the same category of reliability or service.
It's annoying.
But for the service team I worked for, our AZ-evacuation story wasn't great at the time and it took us tens of minutes to manually move out of the AZ, but at least there wasn't a customer-visible availability impact. Once we did it was just monitoring and baby-sitting until we got the word to move back in, I think it was 1-2 days later.
If you operate on AWS you work with the assumption that an AZ is a failure domain, and can die at any time. Surprisingly many service teams at AWS still operate services that don't handle AZ failure that well (at the time). But if you operate services in the cloud you have to know what the failure domain is.
Ouch, hopefully none of the major services? I recently had to look into this for work (for disaster recovery preparation) and it seemed like ECS, Lambda, S3, DynamoDB and Aurora Serverless (and probably CloudWatch and IAM) all said they handled availability zone failures transparently enough.
I’m not that familiar with S3, but I never noticed any concerns with S3 during an AZ outage. I’m not at all familiar with Aurora Serverless or ECS.
For all AWS services you can always ask AWS Support pointed, specific questions about availability. They usually defer to the service team and they’ll give you their perspective.
Also keep in mind that AWS teams differentiate between the availability of their control and data planes. During an AZ outage you may struggle to create/delete resources until an AZ evacuation is completed internally, but already created resources should always meet the public SLA. That’s why especially for DR I recommend active-active or active-“pilot light”, have everything created in all AZs/regions and don’t need to create resources in your DR plan.
I have to assume it’s a fault that not even distributed services can paper over. Eg lots of crucial data in flight and they’re reluctant to drop it. Can an expert weigh in?
I love Google’s post-mortems. This one will be epic.
Right. But nobody forces GCP's customers to design their services to be tolerant of a single DC failure. In fact as a business, actively not designing for such tolerance is an attractive cost-cutting measure.
The logical units are regions and availability zones or the equivalent nomenclature in each cloud. One availability zone is expected to be one or more datacenters.
We have thousands of instances in AWS. I do not know - or care - where they are physically located(other than the region name, say, Oregon). I expect at most one availability zone to get impacted if a datacenter goes up in flames (and sometimes, just a portion of one). I mention in another comment that AWS has had issues before and production systems barely got impacted. And recovered with zero intervention - instances with failed health checks get replaced by brand new ones in whatever AZs are still operational.
> Data would be eventually consistent (once they find and plug the hard drives in)
At the level of abstractions cloud operates, no-one is plugging drives in – someone is, but you can never see it.
Most cloud workloads use network attached storage - when you can even see the logical drives (SaaS offerings may not even have that abstraction). We don't know (or care) how many physical hard drives exist, or where they are. Latency requirements probably dictate that they are close to the actual instances, but there's usually data replication going on even across DCs.
In addition to that, at least in AWS, if you have saved any volume snapshots at all, they will be in S3. This data will be replicated and underlying systems can even use it to restore lost or corrupted data without you even noticing and sometimes even without a recent snapshot, as storage keeps track of what blocks have been rewritten since the last snapshot. In a particularly bad case you might have to do a restore.
In almost a decade and number of volumes in the 6 digits (no clue how many drives that is!) we never had a single volume fail on AWS. Some got into a 'degraded' state and then recovered.
We haven't had any failures on GCP either. In the case of GCP, even faulty hypervisors are transparently worked around - we never notice other than some audit logs saying the VM was moved. They even preserve the network connections. AWS requires a stop/start to do the same, but your VM will be up and running in a different hypervisor (sometimes a different datacenter) in a couple of minutes, with all the storage.
Mind you, AWS promises eleven nines(!) of durability for S3.
When you do have locally attached storage, it's treated as ephemeral and it's gone if the instance restarts.
> I have to assume it’s a fault that not even distributed services can paper over.
If a single datacenter fails, since it _should_ be at most one AZ(this case seems to be different) that will depend on how the application is architected. Requests in flight will obviously fail, how big of a deal depends on the problem domain. For most web apps, this will cause a retry and that's the end of the story, others will be specifically engineered to deal with receiving multiple messages or dropping messages. For example, if you need at most once delivery guarantees, you need to take extra measures
Not all applications can survive an entire region going down. Some can, but that usually raises costs if you are continuously replicating data across regions. If you do that, then you should be able to steer traffic to the surviving regions. You can do that old-school by changing DNS records, or you could have fanciers solutions such as global anycast loadbalancers and have a single IP worldwide that still goes to the closest healthy region.
I think it's safe to assume that most people feel empathy for others struggling, whether or not they type it out regularly. Then again, some AI evangelists have had me questioning that assumption lately.
If you’ve only ever used the cloud, you’re not necessarily aware of everything that’s involved at data centers. If you’re not familiar with them, I don’t think you’d know how many things can (literally) blow up in your face. If someone sees flooding, they generally aren’t thinking that it’ll lead to fires.
Anyway, just want to think that everyone generally has good intentions and just don’t know what’s ACTUALLY happening in the DC, or how much work it will be for the folks working in the DC to restore services.
Hopefully all the failsafes kicked in and worked and nobody was injured.
Literally no concern here for anyone's safety or sanity in dealing with this.
Joking asside I hope we will get a nice postmortem with juicy civil engineering details.
OVH's caught fire
What's next, us-east-1 gets hit by Godzilla?
Source: my startup (stupidly) hosted our entire infra in us-east-1 at the time. Was a …tough day
We called them bugs because you literally had to go in and get the dead bugs out of your electrical system.
Now we can call it fishing because some pirate has sailed onto your datalake and is looking for sunken hashes.
What do you think the hourly for Cloud Architect/Data lake power boy level 1 should start at?
Startlingly accurate in this case.
I think “unknown network” definitely accurately captures what hyperscalers are selling. :)
2) Not be able to use it
3) Company continues to pretend this doesn't happen on the regular
What do you want them to say? "Hey we have X breakdowns but please, pay!"
We're allowed a little humor, damnit.