I've been saying this repeatedly (and downvoted for it repeatedly): if you want truly reliable systems, use simple, boring technology, and don't fuck with it after it's set up, and run it yourself. 99.99% of all these outages are due to screwing up something that already works, something that if it was in your own rack you could just leave alone and not touch at all.
Fiber optic cables are a great technology, but they don't react well to being cut in half by a backhoe. Is the solution you are recommending that we stop using fiber optic cables, or that we stop using backhoes?
I've definitely seen this where I work - the "old guard" setup the system that put the company in a prime market position, the newer people are just doing API calls and scratching their heads if it doesn't work.
Here's a reddit link because YouTube is blocked here.
https://www.reddit.com/r/programming/comments/bq1dt6/jonatha...
Do you:
1) Don't fuck with it?
2) Make a mitigating code change. Patch / fix it (fuck with it)?
The more "the cloud" replaces many, many servers at lots of different places, the more the outages (which once happened all the time, but to many different organizations at different times) will become big enough to notice.
So, yeah, not just your imagination.
This is just for the last few months...?
Please, be kind and decent to each other, especially when things are hard.
I wish these guys and gals luck on getting things working.
"This is a frustrating outage for us, a huge part of the attraction in Google Cloud has been the premise that we get the underlying reliability of Google's infrastructure. If we'd known what the reliability of Google in practice this year would look like, we might have stayed with AWS."
and
"Why are the stupid SRE's at Google even paid such absurd numbers if they can't even go a whole month without multiple hours of downtime."
Criticizing companies is find, just please remember there are real people there.
"Kind and Decent" doesn't seem like a high bar. If "please be kind and decent" is too much of an ask, I pray we never work together.
Criticizing Google is fine, but sometimes, the best deployments to production can go wrong.
If you're a paying customer, you should be free to criticize as you damn well please.
Downvoters pls link here the yelling you have seen.
There are only 3 things I can say about this situation. 1) These issues are currently unrelated. 2) We learn a lot from these situations. 3) A lot of these types of issues can be mitigated by running in more then 1 region.
I really cant promise that today's situations will never happen again. There are a lot of moving pieces in our system and sometimes there are things outside of Google's control.
Are you implying that the cause of this outage is not Google's fault? If so, can you go into more details about that?
When you really care about high availability and security you really don't want all your systems run with the same software, hardware, and coded by the same teams.
What does google (or amazon/msft) do to ensure a software echo chambers are not made within their infrastructure that potentially could cause mass scale outages by way of the same bug or bugs propagating through their systems?
GCP, AWS, and Azure is the grate decentralization of the internet.
I recently left Google to start a startup and now everything is falling apart.
As the updates to [1] say, we're working to resolve a networking issue. The Region isn't (and wasn't) "down", but obviously network latency spiking up for external connectivity is bad.
We are currently experiencing an issue with a subset of the fiber paths that supply the region. We're working on getting that restored. In the meantime, we've removed almost all Google.com traffic out of the Region to prefer GCP customers. That's why the latency increase is subsiding, as we're freeing up the fiber paths by shedding our traffic.
Edit: (since it came up) that also means that if you’re using GCLB and have other healthy Regions, it will rebalance to avoid this congestion/slowdown automatically. That seemed the better trade off given the reduced network capacity during this outage.
[1] https://status.cloud.google.com/incident/cloud-networking/19...
As one of my old bosses said: I don't care that the site/service is technically running, if the customers can't reach it, then IT'S DOWN.
In my case, Cloud PR knows me, but I also knowingly risk my job (I clearly believe I have good enough judgment in what I post). If Urs and Ben think I should be fired, I'm okay with that, as it would represent a significant enough difference in opinion, that I wouldn't want to continue working here anyway.
Finally, for what it's worth, I have been reported before for "leaking internal secrets" here on HN! It turned out to be a totally hilarious discussion with the person tasked with questioning me. Still not fired, gotta try harder :).
One of the things I really like about working at Google is that they place a lot of trust in the judgement of the individual employees. I generally make it clear when I'm stating my personal opinion versus the "official" (for whatever that means given how informal the project is) one, but I don't have to carefully go through an approved list of talking points, run my HN by the legal department, etc.
Obviously, in certain situations, things get more official and formal. For example, when I went to Google IO to give a talk, we did have some documentation and coaching beforehand about how to handle various questions we might get about non-public stuff, other projects related to ours, etc. We are also expected to run any slides by legal before being publicly shown in a venue with a wide audience like IO. But, even then, the legal folks I've worked with have been a pleasure to talk to.
The company's culture is basically "We hired you because you're smart. We trust you to use your brain." It would be squandering resources to not let their employees use their own intelligence and judgement.
Do companies realize how absurd this is?
ETA: It seems someone at Google had a change of heart, and most of what boulos posted in this thread has been added as updates to the official google status page. Better late than never, I guess, especially if this is the start of a trend in outage reporting.
I mostly responded because there was confusion downthread (and in the title) about being “down”. During an outage is a tricky time for comms, so short corrections are best until a full postmortem can be done.
We test our disaster alarms on a known schedule. And just a couple of years ago, during the peak vacation time in the summer, the alarm went off, off schedule.
This made the entire country panic. Were we being attacked? The agency that is supposed to let people know through the public channels like tv, radio etc were silent. They were themselves on vacation probably. The websites and apps they've setup were ridiculously underpowered and were basically DDOS'ed by the spike in traffic they were getting.
News outlets were also struggling, but did way better.
The only thing that withstood the sudden burst in traffic without a hitch was facebook and twitter.
The official statement i think was that the alarm was triggered by accident (never happened before, i think). But goes to show how badly our emergency response is setup.
My impression from their PR seems to mismatch the number of outages and issues lately.
For an individual instance, they only promise 90% availability.
now I wouldn't go as far as to say outages are the normal state of thing, but cloud trades high nines for cheap redundancy. a raid for compute, if you will, and as such a single zone deployment is going to have outages.
(and then there's soft layer which has multiple unplanned sev1 per week)
Due to network failures etc. number visible to customers is unlikely to be going to be much higher than that in cloud services.
please don't...
glass house and all that... but I also share the same glass house as you.. I don't want bad luck
... and it's only a fluke that this happened to google in eu-east1 and not AWS in X region and then you (and I) would be having a time of hell! :/
Their last one was laughable in it's lack of self-awareness.
Nobody really wanted to be [enterprise software]-certified, but it was a way to get their employers to pay for them to go to the conference with cool talks and perks and such.
We delayed the training most of the day, and couldn't say it was AWS' fault because they were sitting in the audience, waiting to get certified.
People were about to riot, that was not a fun day.
the whole point when something like this happens is for you to ensure that a region going down will not impact you - not to laugh at people that use another cloud or to assume that X is better than Y. That being said, there have been several Google related failures lately that don't help building confidence in the GCP offering - if you're just starting in the cloud space this may actually impact the choices you make when you pick your cloud provider.
So my point was to _not_ to laugh at those at google (or those using their services), because AWS might be next.
The whole 'I share the same glass house', was a sort of karma thing.. if someone who uses AWS is laughing at Google. If karma came round and took out AWS, not only would it affect the guy laughing at google, but I'd be the one affected as well as a multitude of other people... and the tables could be easily turned
The point is that AZs are higher level than DCs, so that they provide pretty decent independence guarantees (though you can further derisk with multi-region.)
Well, in AWS. Google's zones have weaker independence assurances (actually, as I read it, no assurances), stating only that a zone “usually has power, cooling, networking, and control planes that are isolated from other zones” [0] as opposed to AWS’s “Availability Zones are physically separated within a typical metropolitan region” and “In addition to discrete uninterruptable power supply (UPS) and onsite backup generation facilities, they are each fed via different grids from independent utilities to further reduce single points of failure. Availability Zones are all redundantly connected to multiple tier-1 transit providers.” [1]
[0] https://cloud.google.com/compute/docs/regions-zones/
[1] https://docs.aws.amazon.com/whitepapers/latest/aws-overview/...
https://9to5google.com/2018/05/23/google-play-youtube-music-...
https://www.digitaltrends.com/music/what-happens-to-google-p...
This is... kind of true, but not really. For a single realm, general game interactions in the open world might have been hosted on a single blade, but there are a lot of support systems that do as much work, or more, that were not. The databases with all of the character information, login servers, instance servers, etc. etc. etc.
But even if you look at just the game server portion, there was a blade for every realm - you can't say World of Warcraft as a monolithic entity ran on a single blade server.
(I'm also not sure if the general game servers for a realm were only on one blade - my understanding is that each "continent" was it's own blade - Kalimdor, Eastern Kingdoms, Northrend, Outland, etc.)
https://nickcraver.com/blog/2016/03/29/stack-overflow-the-ha...
Not sure why they closed that one at 9:12 just to open a new one at 10:25. We didn't see any traffic coming to us-east1 during that time period so I would assume the original issue is still the root cause.
Sorry for the confusion, and yes, the fiber link issue is the root cause. Draining the Google.com traffic presumably resolved the issue for you, though you may still be seeing elevated latency as the updates suggest.
$ host news.ycombinator.com
news.ycombinator.com has address 209.216.230.240
https://whois.arin.net/rest/net/NET-209-216-230-0-1/pft?s=20...M5 Computer Security
Unrelated: https://begthequestion.info/
This is why Ebonics is still a valid form of English - as long as it is used consistently.
If everyone uses "begs the question" and everyone else understands it as "raises the question" then it is perfectly valid.
There were (and continue to be) connectivity issues due to a subset of the fiber links having trouble. But that’s different from being “down”, it’s “just” an outage. We won’t declare the outage over until the impact is minimal.
[1] https://status.cloud.google.com/incident/cloud-networking/19...
You make a good point though. Downtime seems to be awfully overloaded.
I'm going to guess load balancer cascading failures.
I think you’re thinking of AWS’s us-east-1 in Virginia. I don’t recall when us-east1 for us was constructed, but this wasn’t any sort of “old equipment” issue. Even there, while your experience may vary, AWS certainly has both old and new equipment.
I am assuming some sort of construction zone at or nearby the facility and the backhoe operator dug in and accidently cut the cables?
Is this concurrent damage to separated bundles or damage to colocated bundles?
GAE and all other services still show green here, of course: https://status.cloud.google.com/