Google Cloud networking issues in us-east1 (opens in new tab)

(status.cloud.google.com)

586 pointsdecohen6y ago315 comments

315 comments

152 comments · 31 top-level

mehrdadn6y ago· 20 in thread

Does anybody else feel like there have been a lot of outages in recent months? And I don't mean Google -- I mean lots of others too (I seem to recall CloudFlare, Facebook, etc.)... are they really increasing or are we just hearing more about them? Seems a bit odd.

user59944616y ago

Now that you mention it, I just realized why. The current few months are the intern season!

m0zg6y ago

That's more or less inevitable. As complexity increases (which it does naturally, if there's no effort to decrease it) at some point it begins to outstrip the limits of human understanding.

I've been saying this repeatedly (and downvoted for it repeatedly): if you want truly reliable systems, use simple, boring technology, and don't fuck with it after it's set up, and run it yourself. 99.99% of all these outages are due to screwing up something that already works, something that if it was in your own rack you could just leave alone and not touch at all.

dodobirdlord6y ago

> 99.99% of all these outages are due to screwing up something that already works

Fiber optic cables are a great technology, but they don't react well to being cut in half by a backhoe. Is the solution you are recommending that we stop using fiber optic cables, or that we stop using backhoes?

2 more replies

Operyl6y ago

My horribly out of date system works, therefore I should never strive to improve it or god forbid update it (since that involves “fucking with it” in ways that can break it from version to version)? That gets you technical debt and that’s not fun.

2 more replies

hnick6y ago

Have you seen Jonathan's Blow talk that touched on this? I enjoyed it. I think his fundamental point is that as we build on complexity, future generations lose track of the underpinnings and things start failing for unexpected reasons and we may eventually lose our capability entirely. But he does meander a lot.

I've definitely seen this where I work - the "old guard" setup the system that put the company in a prime market position, the newer people are just doing API calls and scratching their heads if it doesn't work.

Here's a reddit link because YouTube is blocked here.

https://www.reddit.com/r/programming/comments/bq1dt6/jonatha...

bamboozled6y ago

So a vulnerability is identified in a version of software you're running within your stack and doing nothing means you will most likely lose important and sensitive customer information if you do nothing about it.

Do you:

1) Don't fuck with it?

2) Make a mitigating code change. Patch / fix it (fuck with it)?

1 more reply

archy_6y ago

Cloud should be a backup, a failover, but people build their entire business on other people's hardware because they can sell the cost per hour easier than the price of a new server which is cheaper in the long run. At this point, with so many outages showing the need for self-hosting, not allowing customers to do so shows how little you care about them.

2 more replies

niyazpk6y ago

As more businesses move their compute to the cloud, one might predict that more people will be impacted by outages in the large cloud providers. This in turn means that the affected people will start up-voting these threads. Expect these to be more common.

mehrdadn6y ago

I don't see how this is something that's specific to the last few months though.

fredthomsen6y ago

And unfortunately that is making the web more centralized.

kowdermeister6y ago

It's just global warming again. The weather in the clouds gets increasingly unpredictable :)

jimmaswell6y ago

I came here to say this - it's like the cloud as a whole is imploding lately.

ajhurliman6y ago

Seems like if it continues to be a problem that more multi-cloud solutions will present themselves (Terraform does that sort of thing, right?).

jcims6y ago

Terraform gives you a single management stack to a number of services and endpoints, but it doesn’t magically make your solution multi-cloud...you still need to understand the architecture you are deploying and the idiosyncrasies of each provider and the services used (not a bad thing imo).

opsunit6y ago

Terraform does not handle data locality. Since compute generally sits next to data for latency and cost reasons one should first think about how to ensure that their (perhaps considerable) data set is stored and synchronised elsewhere before worrying about which infrastructure manifest tool to use.

richardw6y ago

What do people do to mitigate DNS services from going down? Is it possible to have multiple services for that? And CDN's too as per our recent CloudFlare issues.

1 more reply

swozey6y ago

microclouds!

xapata6y ago

Tinfoil hat: Maybe someone practicing for an attack?

rossdavidh6y ago

It's almost as if we had made an overly complicated system with too much "efficiency" and thus not enough redundancy, centralizing on too few pieces of what used to be a quite widely dispersed system.

The more "the cloud" replaces many, many servers at lots of different places, the more the outages (which once happened all the time, but to many different organizations at different times) will become big enough to notice.

So, yeah, not just your imagination.

mehrdadn6y ago

> It's almost as if we had made an overly complicated system with too much "efficiency" and thus not enough redundancy, centralizing on too few pieces of what used to be a quite widely dispersed system. The more "the cloud" replaces many, many servers at lots of different places, the more the outages (which once happened all the time, but to many different organizations at different times) will become big enough to notice.

This is just for the last few months...?

estsauver6y ago· 14 in thread

In a moment that's likely to be very, very frustrating for a large number of you that have businesses and customers that depend on G cloud, let's try to remember that somewhere there's an engineer or an SRE having a really hard day just trying to fix things.

Please, be kind and decent to each other, especially when things are hard.

aNoob70006y ago

As someone in the infrastructure side of the house, people rarely understand all the things that go on behind the scenes to keep things running. The only time people notice you are when things go down.

I wish these guys and gals luck on getting things working.

wbl6y ago

There but for the grace of God go we.

danaur6y ago

I don't follow comments like these, should people refrain from criticising giant companies because there are people working at them? I don't understand the purpose of this comment

highesttide6y ago

Complaining about the communication and response time of a company is different from yelling in the direction of some stressed engineer that they are useless and incompetent at everything they do. Sadly you get too much of the latter around the Internet.

1 more reply

estsauver6y ago

My hope is that we can be kind and decent to other people even in moments of stress. Take two very different example comments:

"This is a frustrating outage for us, a huge part of the attraction in Google Cloud has been the premise that we get the underlying reliability of Google's infrastructure. If we'd known what the reliability of Google in practice this year would look like, we might have stayed with AWS."

and

"Why are the stupid SRE's at Google even paid such absurd numbers if they can't even go a whole month without multiple hours of downtime."

Criticizing companies is find, just please remember there are real people there.

"Kind and Decent" doesn't seem like a high bar. If "please be kind and decent" is too much of an ask, I pray we never work together.

1 more reply

jpitz6y ago

The purpose of the comment, to me, is to remind folks to refrain from taking your frustration with a product or a company out on a person.

1 more reply

jdoliner6y ago

It's supposed to remind you that the real nines of availability are the friends you make a long the way.

voldemort19686y ago

It's really quite simple. It's a reminder to be respectful to the individuals taking part in this on Google's side. Are you following now?

aNoob70006y ago

No, He's just talking about all the people running around right now trying to figure out what went wrong and how to fix it.

Criticizing Google is fine, but sometimes, the best deployments to production can go wrong.

1 more reply

vidar6y ago

He is asking people to be constructive in their criticism

codesushi426y ago

It makes even less sense when you take into consideration that people are paying for this service.

If you're a paying customer, you should be free to criticize as you damn well please.

StreamBright6y ago

Me neither. I haven't seen a single time somebody yelling at an engineer of Google in the middle of an outage.

Downvoters pls link here the yelling you have seen.

bobobooey6y ago

kumbaya my lord

dang6y ago

Can you please not post unsubstantive comments to HN?

1 more reply

thsowers6y ago· 12 in thread

Why so many problems at Google lately? Calendar down two weeks ago[0], and Google Cloud had a larger outage a month ago[1]

[0]: https://news.ycombinator.com/item?id=20213092

[1]: https://news.ycombinator.com/item?id=20077421

tscanausa6y ago

Terrance here from Google Cloud Support.

There are only 3 things I can say about this situation. 1) These issues are currently unrelated. 2) We learn a lot from these situations. 3) A lot of these types of issues can be mitigated by running in more then 1 region.

I really cant promise that today's situations will never happen again. There are a lot of moving pieces in our system and sometimes there are things outside of Google's control.

mathattack6y ago

“You should be using more than 1 region” could also be “you should be using more than one provider”, no?

9 more replies

gizmo3856y ago

> There are a lot of moving pieces in our system and sometimes there are things outside of Google's control.

Are you implying that the cause of this outage is not Google's fault? If so, can you go into more details about that?

2 more replies

iamaelephant6y ago

Cool man let me know how I can run my Calendar in multiple regions.

Decabytes6y ago

Thanks for the reply Terrance. But isn't it more expensive to run in more than one region?

1 more reply

marme6y ago

how do you use multiple regions when Google only supports certain things in limited regions like Dataflow Shuffle only being available in a single region in north america https://cloud.google.com/dataflow/docs/guides/deploying-a-pi...

zbowling6y ago

unrelated. very big company with thousands of products that don't suffer outages. two incidents doesn't make a pattern.

thsowers6y ago

I would argue that two direct Google Cloud outages within a month is pretty concerning for GCP customers, and that it's possible that the calendar outage could also be related in someway since it is likely hosted on GCP, although that is speculation

1 more reply

mbrumlow6y ago

So you are making a case for smaller companies run by different people in different ways? So that we don't have huge outages with common systems shared across entire platforms misbehave?

When you really care about high availability and security you really don't want all your systems run with the same software, hardware, and coded by the same teams.

What does google (or amazon/msft) do to ensure a software echo chambers are not made within their infrastructure that potentially could cause mass scale outages by way of the same bug or bugs propagating through their systems?

GCP, AWS, and Azure is the grate decentralization of the internet.

1 more reply

avocado46y ago

> Why so many problems at Google lately?

I recently left Google to start a startup and now everything is falling apart.

reading-at-work6y ago

Don't forget the Google Fi outage from a short while ago: https://www.theverge.com/2019/6/3/18650851/google-fi-service...

foobiekr6y ago

Regression to the mean.

boulos6y ago· 11 in thread

Disclosure: I work on Google Cloud (but I'm not in SRE, oncall, etc.).

As the updates to [1] say, we're working to resolve a networking issue. The Region isn't (and wasn't) "down", but obviously network latency spiking up for external connectivity is bad.

We are currently experiencing an issue with a subset of the fiber paths that supply the region. We're working on getting that restored. In the meantime, we've removed almost all Google.com traffic out of the Region to prefer GCP customers. That's why the latency increase is subsiding, as we're freeing up the fiber paths by shedding our traffic.

Edit: (since it came up) that also means that if you’re using GCLB and have other healthy Regions, it will rebalance to avoid this congestion/slowdown automatically. That seemed the better trade off given the reduced network capacity during this outage.

[1] https://status.cloud.google.com/incident/cloud-networking/19...

mrweasel6y ago

>The Region isn't (and wasn't) "down", but obviously network latency spiking up for external connectivity is bad.

As one of my old bosses said: I don't care that the site/service is technically running, if the customers can't reach it, then IT'S DOWN.

jodrellblank6y ago

Your boss picked a ridiculous time to nitpick over wording, to shout and add stress to an already difficult situation, and giving up accuracy and precise understanding at a time those are most important.

5 more replies

ricardobeat6y ago

Tangential question: does Google allow employees, not directly tasked with it, to represent the company online as they wish? Most companies I know of have a strict ‘do not speak for the company’ policy.

boulos6y ago

As kyrra says below, you're in the clear if you state that this is just your opinion. Naturally, prefacing something terrible as "just your opinion" doesn't make it fine.

In my case, Cloud PR knows me, but I also knowingly risk my job (I clearly believe I have good enough judgment in what I post). If Urs and Ben think I should be fired, I'm okay with that, as it would represent a significant enough difference in opinion, that I wouldn't want to continue working here anyway.

Finally, for what it's worth, I have been reported before for "leaking internal secrets" here on HN! It turned out to be a totally hilarious discussion with the person tasked with questioning me. Still not fired, gotta try harder :).

5 more replies

munificent6y ago

I work at Google on an open source project and comment on it frequently.

One of the things I really like about working at Google is that they place a lot of trust in the judgement of the individual employees. I generally make it clear when I'm stating my personal opinion versus the "official" (for whatever that means given how informal the project is) one, but I don't have to carefully go through an approved list of talking points, run my HN by the legal department, etc.

Obviously, in certain situations, things get more official and formal. For example, when I went to Google IO to give a talk, we did have some documentation and coaching beforehand about how to handle various questions we might get about non-public stuff, other projects related to ours, etc. We are also expected to run any slides by legal before being publicly shown in a venue with a wide audience like IO. But, even then, the legal folks I've worked with have been a pleasure to talk to.

The company's culture is basically "We hired you because you're smart. We trust you to use your brain." It would be squandering resources to not let their employees use their own intelligence and judgement.

1 more reply

user59944616y ago

Google employees are commenting publicly and on Hacker News all the time. If there is a policy of not speaking publicly about the company, this has been the most blatantly ignored policy ever.

1 more reply

kyrra6y ago

It's a fine line. We are not allowed to represent Google in any kind of public discussion. But we can talk about some things we do, as long as we state it's our own opinion and we don't represent Google's views.

1 more reply

jauer6y ago

It's probably less "as they wish" and more "here's an approved statement" or "your role involves engaging with external parties, here are some guidelines"

edwintorok6y ago

You seem to have 3 status messages on the dashboard at 14:31, 14:44 and 14:48 with exactly the same contents. Were those messages really posted 3 times, or did something go wrong and they got duplicated?

FrankPetrilli6y ago

We're aware this happened - that posting is the responsibility of an adjacent team to my own, specifically the person right next to me. :)

walshemj6y ago

Sounds like back hoe fade (from the write up) and it sounds like multiple cables sharing the same physical route got taken out.

harshreality6y ago· 8 in thread

Hacker News: The real status page and help desk for the internet.

Do companies realize how absurd this is?

ETA: It seems someone at Google had a change of heart, and most of what boulos posted in this thread has been added as updates to the official google status page. Better late than never, I guess, especially if this is the start of a trend in outage reporting.

boulos6y ago

The outage information is fairly reasonable. Not everyone cares (nor should they!) about the why only what the situation is, and that people are on it. This is extra detail.

I mostly responded because there was confusion downthread (and in the title) about being “down”. During an outage is a tricky time for comms, so short corrections are best until a full postmortem can be done.

hajhatten6y ago

This reminds me of an incident in Sweden a couple of years ago.

We test our disaster alarms on a known schedule. And just a couple of years ago, during the peak vacation time in the summer, the alarm went off, off schedule.

This made the entire country panic. Were we being attacked? The agency that is supposed to let people know through the public channels like tv, radio etc were silent. They were themselves on vacation probably. The websites and apps they've setup were ridiculously underpowered and were basically DDOS'ed by the spike in traffic they were getting.

News outlets were also struggling, but did way better.

The only thing that withstood the sudden burst in traffic without a hitch was facebook and twitter.

The official statement i think was that the alarm was triggered by accident (never happened before, i think). But goes to show how badly our emergency response is setup.

issati6y ago

It goes to show how badly it is set up for a false alarm. In a real emergency all the primary functions would go up (taking over radio broadcasts for example) so there wouldn't be the same problem. It is still bad of course because of the "cry wolf" factor.

1 more reply

notatoad6y ago

seriously, they've got a text field on the official status page, why not put the text boulos posted here in that instead of the meaningless text they've got there?

david-cako6y ago

I work for AWS. There is typically a balance that has to be struck when sharing information with customers. I would imagine this goes for most companies, which is why it isn't until a post-mortem that the messaging is fully refined.

1 more reply

boulos6y ago

Can you expand on why you find it “meaningless”? As my other comment says, I’m not in SRE and the real people fixing it are trying their best to remediate the problem. I agree that the text I posted (with blessing from SRE!) gives you some more detail, but you can’t do anything differently with it, right? What about the new text do you prefer? (We’re happy to improve!)

3 more replies

NikolaeVarius6y ago

You seem to have forgotten twitter

benburleson6y ago

We can dream.

username4446y ago· 8 in thread

Cloudflare was returning a 502 this morning, wonder if they're related. Lots and lots of sites down for about an hour, including all of Shopify.

boulos6y ago

As jgrahamc (Cloudflare CTO) noted below, these aren't related. They had a push that they rolled back, we lost some fiber links.

jhgg6y ago

Cloudflare took us down this morning, but also shielded us from the impact of this fiber cut, due to direct peering with google (I’m assuming over different fiber paths.)

benbristow6y ago

I highly doubt Google are using CloudFlare networks. Must be just a coincidence.

totaldude876y ago

or CloudFlare using GCP :)

1 more reply

mirceal6y ago

nope. cloudflare had a bad push / deployment.

naniwaduni6y ago

"bad push / deployment" seems like it covers 108% of breakage.

1 more reply

autoexec6y ago

It's good we've built this massive decentralized network to withstand even major nuclear attacks only to have massive parts of it fail because we've put so much in a few centralized and failable hands.

jgrahamc6y ago

Not related

hnaccy6y ago· 7 in thread

What's the actual number of 9s for the major cloud services these days?

My impression from their PR seems to mismatch the number of outages and issues lately.

Johnny5556y ago

AWS EC2 promises 4 9's (4.3 minutes of downtime/month) before their SLA kicks in, but they only give a 10% discount until availability dips below 99% (7.5 hours of downtime/month) when they give a 30% discount. If availability is below 95% (36 hours) in a month, they give a full refund.

For an individual instance, they only promise 90% availability.

user59944616y ago

Availability of what? I've noticed entire afternoon where it wasn't possible to provision instances of some types, when I was working with AWS daily.

2 more replies

itslennysfault6y ago

What a craptactular SLA.

1 more reply

LoSboccacc6y ago

I don't think that thinking about cloud computing in terms of nines is the correct way to frame the issue. cloud storage, maybe, but cloud gives you the cheap bricks you need to build the nines for your customers using replication, independent zones and clever routing.

now I wouldn't go as far as to say outages are the normal state of thing, but cloud trades high nines for cheap redundancy. a raid for compute, if you will, and as such a single zone deployment is going to have outages.

(and then there's soft layer which has multiple unplanned sev1 per week)

zzzcpan6y ago

I don't think there is any mismatch, it was always three nines. I guess the only mismatch is in claims that three nines is enough to not be noticeable or annoying to people.

nabla96y ago

I think Google has 99.5% in most of their services.

Due to network failures etc. number visible to customers is unlikely to be going to be much higher than that in cloud services.

StreamBright6y ago

Depending on which metric you are talking about. There roughly 10 metrics you might be thinking of.

mrmattyboy6y ago· 6 in thread

To whomever commented something like 'laughs in AWS' (comment was removed before I submitted the comment)...

please don't...

glass house and all that... but I also share the same glass house as you.. I don't want bad luck

... and it's only a fluke that this happened to google in eu-east1 and not AWS in X region and then you (and I) would be having a time of hell! :/

outworlder6y ago

Google seems to be more forthcoming with their issues. We have seen incidents in AWS where the status never got updated, but support confirmed issues.

deanCommie6y ago

Show me a GCP post-mortem that's as detailed and proactive about future improvement as https://status.aws.amazon.com/s3-20080720.html

Their last one was laughable in it's lack of self-awareness.

1 more reply

dymk6y ago

And did we forget about the insane AWS east outages of two years ago?

fjp6y ago

During that AWS outage I was training people on [enterprise software] as part of the certification portion of [enterprise software company annual conference].

Nobody really wanted to be [enterprise software]-certified, but it was a way to get their employers to pay for them to go to the conference with cool talks and perks and such.

We delayed the training most of the day, and couldn't say it was AWS' fault because they were sitting in the audience, waiting to get certified.

People were about to riot, that was not a fun day.

mirceal6y ago

i don't quite follow your logic. something about glass houses and bad luck?

the whole point when something like this happens is for you to ensure that a region going down will not impact you - not to laugh at people that use another cloud or to assume that X is better than Y. That being said, there have been several Google related failures lately that don't help building confidence in the GCP offering - if you're just starting in the cloud space this may actually impact the choices you make when you pick your cloud provider.

mrmattyboy6y ago

My point was that, there was a comment from someone saying 'laughing from aws' and I was trying to point out that each service is most likely (or should be considered to be) as fragile (in the relative sense) as each other. So just because google have gone down several times, doesn't mean that AWS won't have a line of outages next. Really, their services are much of a black hole to us.. we can't see _how_ they deploy their changes, what kind of reviewing they do etc. etc. Even down to how cleverly they have _actually_ architected their DCs.

So my point was to _not_ to laugh at those at google (or those using their services), because AWS might be next.

The whole 'I share the same glass house', was a sort of karma thing.. if someone who uses AWS is laughing at Google. If karma came round and took out AWS, not only would it affect the guy laughing at google, but I'd be the one affected as well as a multitude of other people... and the tables could be easily turned

inlined6y ago· 5 in thread

Holy crap. It’s an outage in all zones? What’s the point of AZs if you lose whole DCs at a time.

dragonwriter6y ago

> What’s the point of AZs if you lose whole DCs at a time.

The point is that AZs are higher level than DCs, so that they provide pretty decent independence guarantees (though you can further derisk with multi-region.)

Well, in AWS. Google's zones have weaker independence assurances (actually, as I read it, no assurances), stating only that a zone “usually has power, cooling, networking, and control planes that are isolated from other zones” [0] as opposed to AWS’s “Availability Zones are physically separated within a typical metropolitan region” and “In addition to discrete uninterruptable power supply (UPS) and onsite backup generation facilities, they are each fed via different grids from independent utilities to further reduce single points of failure. Availability Zones are all redundantly connected to multiple tier-1 transit providers.” [1]

[0] https://cloud.google.com/compute/docs/regions-zones/

[1] https://docs.aws.amazon.com/whitepapers/latest/aws-overview/...

klodolph6y ago

Availability is hierarchical.

neonate6y ago

Can you explain that more?

1 more reply

MrStonedOne6y ago

Operational Consistency creates a hidden single point of failure

dekhn6y ago

regions are the point. this is known as a "meteor outage".

dx876y ago· 5 in thread

Kind of related to this, but these types of outages are why I moved from Google Play to Spotify for streaming music. Their infrastructure seems so large that things that should be a standalone service, like streaming music, are bound to be collateral damage when they mess something up on another service. Having everything provided by one company is convenient until it all goes down at the same time and you can't access your email, videos, or music because they all run on the same infrastructure.

tomschlick6y ago

Spotify is hosted on google cloud: https://www.wired.com/2016/02/spotify-moves-itself-onto-goog...

crusader766y ago

I think the point OP was trying to make was relating to google services and their dependencies on each other.

1 more reply

cbhl6y ago

In my personal opinion, you should move off of Google Play Music, but not because of the dependency on Google infrastructure.

https://9to5google.com/2018/05/23/google-play-youtube-music-...

https://www.digitaltrends.com/music/what-happens-to-google-p...

mav3rick6y ago

Spotify is on Google Cloud.

Thaxll6y ago

I'm a heavy user of Google Play for the last 6 years, I never had a single outage since I use the service ( multiple hours a day, 5 days a week )

Thaxll6y ago· 4 in thread

Looks like an external issue. "The Cloud Networking service (Standard Tier) has lost multiple independent fiber links within us-east1 zone. Vendor has been notified and are currently investigating the issue."

imroot6y ago

It's not independent fiber links if they use the same tube to get into the building...just ask any backhoe operator.

fredthomsen6y ago

my brother-in-law's construction company actually did just that. ground wasn't properly marked and the fiber got cut, multiple links

1 more reply

foobiekr6y ago

It’s surprisingly hard to avoid shared fate links and it’s one of the things I would have thought google would be expert at.

vinay_ys6y ago

It's not that hard. In India because of so much construction related digging cuts OFCs, we do the path planning quite well and our redundancies get tested quite regularly whether you want to or not.

2 more replies

z3t46y ago· 4 in thread

When choosing a big cloud provider people forget that it's many orders of magnitude more complicated to run something at Google scale then to maintain one single server. For example the whole Stack overflow website runs on one or two servers. World of Warcraft also used to run on one single (blade) server. Chances are one server will be good enough for most use cases. And if you don't want to have it in your closet there are plenty of dedicated hosting and colocations.

cthalupa6y ago

>World of Warcraft also used to run on one single (blade) server.

This is... kind of true, but not really. For a single realm, general game interactions in the open world might have been hosted on a single blade, but there are a lot of support systems that do as much work, or more, that were not. The databases with all of the character information, login servers, instance servers, etc. etc. etc.

But even if you look at just the game server portion, there was a blade for every realm - you can't say World of Warcraft as a monolithic entity ran on a single blade server.

(I'm also not sure if the general game servers for a realm were only on one blade - my understanding is that each "continent" was it's own blade - Kalimdor, Eastern Kingdoms, Northrend, Outland, etc.)

edwintorok6y ago

With a cloud it also means that when there is an outage there are potentially many sites/services affected all at once, and there is potentially nothing customers can do to fix it other than wait (or plan in advance, and use/pay for multi-AZ/multi-region/multi-provider redundancy). Such outages are also possible with traditional hosting providers, and when an outage does happen I'm not convinced whether a large public cloud would recover more quickly (due to better resourcing/expertise available to fix the problem), or a small hosting provider (which may have a smaller team, but the problems they deal with are at a smaller scale and more easily fixable). Either way you probably want some kind of CDN independent of your cloud/hosting provider that can help survive some of these glitches.

avocado46y ago

How can Stack Overflow run on a single server? Do you mean single cluster?

davedunkin6y ago

As of 2016, Stack Overflow ran on dozens of servers in two data centers.

https://nickcraver.com/blog/2016/03/29/stack-overflow-the-ha...

1 more reply

fastest9636y ago· 3 in thread

Here's the original issue: https://status.cloud.google.com/incident/cloud-networking/19...

Not sure why they closed that one at 9:12 just to open a new one at 10:25. We didn't see any traffic coming to us-east1 during that time period so I would assume the original issue is still the root cause.

boulos6y ago

Yeah, that happens sometimes based on which team notices, thinks it might be different and then opens an outage.

Sorry for the confusion, and yes, the fiber link issue is the root cause. Draining the Google.com traffic presumably resolved the issue for you, though you may still be seeing elevated latency as the updates suggest.

fastest9636y ago

Since we use GCP Global LBs I presume that "draining the Google.com traffic" also meant that you're diverting all global LB traffic, which is what we see. The second incident (the OP's link) indicates that but at first it was very confusing to a customer when the first issue was marked as resolved but we still saw no traffic being sent to us-east1 via our global LBs. If that makes sense.

1 more reply

joshuamorton6y ago

Hopefully the thread title can be updated. (If it were actually down, this thread would have been posted 3 hours ago and have 400+ comments).

digitalsanctum6y ago· 3 in thread

I routinely see notices of outages like this posted on HN while HN itself never seems to be impacted. This begs the question: Where and how is HN hosted in a way that avoids being impacted by widespread network and provider outages?

hunter2_6y ago

  $ host news.ycombinator.com
  news.ycombinator.com has address 209.216.230.240

https://whois.arin.net/rest/net/NET-209-216-230-0-1/pft?s=20...

M5 Computer Security

https://www.m5hosting.com

Unrelated: https://begthequestion.info/

MisterPea6y ago

The begs the question site is one of my pet peeves. Language is not moderated by a select few who want to claim it, this isn't France.

This is why Ebonics is still a valid form of English - as long as it is used consistently.

If everyone uses "begs the question" and everyone else understands it as "raises the question" then it is perfectly valid.

2 more replies

jsjohnst6y ago

I’ve seen a few outages that impacted HN from a provider standpoint. Good example is the CenturyLink outage a few weeks back. CenturyLink isn’t my ISP and neither is it directly HN’s either, but my route to HN was impacted by the outage.

partiallypro6y ago· 2 in thread

It's been down for 4 hours and it's just now being posted on HN? Is it intermittent?

boulos6y ago

Disclosure: I work on Google Cloud.

There were (and continue to be) connectivity issues due to a subset of the fiber links having trouble. But that’s different from being “down”, it’s “just” an outage. We won’t declare the outage over until the impact is minimal.

larkeith6y ago

From another comment, original issue [1] was closed at 9:12, so looks like they got it back up for a bit over an hour before it went down again. Post-mortem will be interesting.

[1] https://status.cloud.google.com/incident/cloud-networking/19...

bob332126y ago· 2 in thread

Down or just high latency? For some folks that is the same thing.

crankylinuxuser6y ago

is a 4 hr latency, "latency"?

You make a good point though. Downtime seems to be awfully overloaded.

geogram6y ago

On our tests the latency is surprisingly low (20-40ms) but it has an error rate of 10-30%.

noncoml6y ago· 2 in thread

Bad config push again?

gaogao6y ago

Running a betting pool on cloud service outage root causes would be fairly fun.

I'm going to guess load balancer cascading failures.

notriddle6y ago

Nope. Physical destruction of fiber-optic cables is to blame, according to the GC status page. https://status.cloud.google.com/incident/cloud-networking/19...

geogram6y ago· 2 in thread

Longer than 4 hours. We have stackdriver setup to monitor uptime/latency and its been acting up since 2am PST.

tlynchpin6y ago

ObPedant: notice in google's status page "...as of Tuesday, 2019-07-02 09:11 US/Pacific." This notation is useful because it's stable year round. I don't recommend 'PDT', instead colloquially 'out here on the left coast' or specifically US/Pacific.

geogram6y ago

Thanks. Good point. Regardless, gcloud has been having issues for nearly 12hours. (timezone agnostic)

codingslave6y ago· 2 in thread

Google engineers ran into a coding problem that wasnt on leetcode

dang6y ago

Please don't post unsubstantive comments here.

sieabahlpark6y ago

Now isn't that the truth

lgats6y ago· 1 in thread

Pretty sure I've read before that us-east1 is one of the older Google data centers presumably with older equipment

boulos6y ago

Disclosure: I work on Google Cloud.

I think you’re thinking of AWS’s us-east-1 in Virginia. I don’t recall when us-east1 for us was constructed, but this wasn’t any sort of “old equipment” issue. Even there, while your experience may vary, AWS certainly has both old and new equipment.

pupdogg6y ago

> The disruptions with Google Cloud Networking and Load Balancing have been root caused to physical damage to multiple concurrent fiber bundles serving network paths in us-east1.

I am assuming some sort of construction zone at or nearby the facility and the backhoe operator dug in and accidently cut the cables?

verdverm6y ago

I've been working out of us-east1 all day and haven't noticed

pkaye6y ago

Looks like all that high end engineering talent and processes still has its limits.

dragonwriter6y ago

> The disruptions with Google Cloud Networking and Load Balancing have been root caused to physical damage to multiple concurrent fiber bundles

Is this concurrent damage to separated bundles or damage to colocated bundles?

saltminer6y ago

The title says "almost 4 hours" (was posted at around 3 PM EST), but the incident was created at 10:25 AM PST, which is 1:25 PM EST. Has it been more like 2 hours or is there more to this incident?

wwwpppddd6y ago

App Engine and Cloud functions were apparently returning error rates of > 30 percent overall between 11 a.m. and 3 p.m., with some projects experiencing a 100 percent error rate. GCS was also experiencing issues for the first half, which was attributed to the networking issues. Google said the networking issues were resolved initially but then stated they were investigating the GAE issues. Those issues were resolved, and the networking issue has been reopened as of 2:35 eastern: https://status.cloud.google.com/incident/cloud-networking/19....

GAE and all other services still show green here, of course: https://status.cloud.google.com/

mountainofdeath6y ago

Another day, another Google outage. It feels like it's once a month this year

rco87866y ago

2019 has been a really rough year for GCP

garyb26y ago

Notice they did not get around posting the next status update on time.

awinter-py6y ago

do they not have extra hands on staff to dedup the messages? what's with the identical messages at 14:31, :44, :48? This happened last time too.

1 more reply

spullara6y ago

Wow. GCP is always a networking issue. Their QA on networking changes needs work. Maybe they should spend 20% on it.

j / k navigate · click thread line to collapse

315 comments

152 comments · 31 top-level

mehrdadn6y ago· 20 in thread

user59944616y ago

Now that you mention it, I just realized why. The current few months are the intern season!

m0zg6y ago

That's more or less inevitable. As complexity increases (which it does naturally, if there's no effort to decrease it) at some point it begins to outstrip the limits of human understanding.

dodobirdlord6y ago

> 99.99% of all these outages are due to screwing up something that already works

2 more replies

Operyl6y ago

2 more replies

hnick6y ago

Here's a reddit link because YouTube is blocked here.

https://www.reddit.com/r/programming/comments/bq1dt6/jonatha...

bamboozled6y ago

Do you:

1) Don't fuck with it?

2) Make a mitigating code change. Patch / fix it (fuck with it)?

1 more reply

archy_6y ago

2 more replies

niyazpk6y ago

mehrdadn6y ago

I don't see how this is something that's specific to the last few months though.

fredthomsen6y ago

And unfortunately that is making the web more centralized.

kowdermeister6y ago

It's just global warming again. The weather in the clouds gets increasingly unpredictable :)

jimmaswell6y ago

I came here to say this - it's like the cloud as a whole is imploding lately.

ajhurliman6y ago

Seems like if it continues to be a problem that more multi-cloud solutions will present themselves (Terraform does that sort of thing, right?).

jcims6y ago

opsunit6y ago

richardw6y ago

What do people do to mitigate DNS services from going down? Is it possible to have multiple services for that? And CDN's too as per our recent CloudFlare issues.

1 more reply

swozey6y ago

microclouds!

xapata6y ago

Tinfoil hat: Maybe someone practicing for an attack?

rossdavidh6y ago

So, yeah, not just your imagination.

mehrdadn6y ago

This is just for the last few months...?

estsauver6y ago· 14 in thread

Please, be kind and decent to each other, especially when things are hard.

aNoob70006y ago

I wish these guys and gals luck on getting things working.

wbl6y ago

There but for the grace of God go we.

danaur6y ago

I don't follow comments like these, should people refrain from criticising giant companies because there are people working at them? I don't understand the purpose of this comment

highesttide6y ago

1 more reply

estsauver6y ago

My hope is that we can be kind and decent to other people even in moments of stress. Take two very different example comments:

and

"Why are the stupid SRE's at Google even paid such absurd numbers if they can't even go a whole month without multiple hours of downtime."

Criticizing companies is find, just please remember there are real people there.

"Kind and Decent" doesn't seem like a high bar. If "please be kind and decent" is too much of an ask, I pray we never work together.

1 more reply

jpitz6y ago

The purpose of the comment, to me, is to remind folks to refrain from taking your frustration with a product or a company out on a person.

1 more reply

jdoliner6y ago

It's supposed to remind you that the real nines of availability are the friends you make a long the way.

voldemort19686y ago

It's really quite simple. It's a reminder to be respectful to the individuals taking part in this on Google's side. Are you following now?

aNoob70006y ago

No, He's just talking about all the people running around right now trying to figure out what went wrong and how to fix it.

Criticizing Google is fine, but sometimes, the best deployments to production can go wrong.

1 more reply

vidar6y ago

He is asking people to be constructive in their criticism

codesushi426y ago

It makes even less sense when you take into consideration that people are paying for this service.

If you're a paying customer, you should be free to criticize as you damn well please.

StreamBright6y ago

Me neither. I haven't seen a single time somebody yelling at an engineer of Google in the middle of an outage.

Downvoters pls link here the yelling you have seen.

bobobooey6y ago

kumbaya my lord

dang6y ago

Can you please not post unsubstantive comments to HN?

1 more reply

thsowers6y ago· 12 in thread

Why so many problems at Google lately? Calendar down two weeks ago[0], and Google Cloud had a larger outage a month ago[1]

[0]: https://news.ycombinator.com/item?id=20213092

[1]: https://news.ycombinator.com/item?id=20077421

tscanausa6y ago

Terrance here from Google Cloud Support.

I really cant promise that today's situations will never happen again. There are a lot of moving pieces in our system and sometimes there are things outside of Google's control.

mathattack6y ago

“You should be using more than 1 region” could also be “you should be using more than one provider”, no?

9 more replies

gizmo3856y ago

> There are a lot of moving pieces in our system and sometimes there are things outside of Google's control.

Are you implying that the cause of this outage is not Google's fault? If so, can you go into more details about that?

2 more replies

iamaelephant6y ago

Cool man let me know how I can run my Calendar in multiple regions.

Decabytes6y ago

Thanks for the reply Terrance. But isn't it more expensive to run in more than one region?

1 more reply

marme6y ago

zbowling6y ago

unrelated. very big company with thousands of products that don't suffer outages. two incidents doesn't make a pattern.

thsowers6y ago

1 more reply

mbrumlow6y ago

So you are making a case for smaller companies run by different people in different ways? So that we don't have huge outages with common systems shared across entire platforms misbehave?

When you really care about high availability and security you really don't want all your systems run with the same software, hardware, and coded by the same teams.

GCP, AWS, and Azure is the grate decentralization of the internet.

1 more reply

avocado46y ago

> Why so many problems at Google lately?

I recently left Google to start a startup and now everything is falling apart.

reading-at-work6y ago

Don't forget the Google Fi outage from a short while ago: https://www.theverge.com/2019/6/3/18650851/google-fi-service...

foobiekr6y ago

Regression to the mean.

boulos6y ago· 11 in thread

Disclosure: I work on Google Cloud (but I'm not in SRE, oncall, etc.).

As the updates to [1] say, we're working to resolve a networking issue. The Region isn't (and wasn't) "down", but obviously network latency spiking up for external connectivity is bad.

[1] https://status.cloud.google.com/incident/cloud-networking/19...

mrweasel6y ago

>The Region isn't (and wasn't) "down", but obviously network latency spiking up for external connectivity is bad.

As one of my old bosses said: I don't care that the site/service is technically running, if the customers can't reach it, then IT'S DOWN.

jodrellblank6y ago

5 more replies

ricardobeat6y ago

boulos6y ago

As kyrra says below, you're in the clear if you state that this is just your opinion. Naturally, prefacing something terrible as "just your opinion" doesn't make it fine.

5 more replies

munificent6y ago

I work at Google on an open source project and comment on it frequently.

1 more reply

user59944616y ago

Google employees are commenting publicly and on Hacker News all the time. If there is a policy of not speaking publicly about the company, this has been the most blatantly ignored policy ever.

1 more reply

kyrra6y ago

1 more reply

jauer6y ago

It's probably less "as they wish" and more "here's an approved statement" or "your role involves engaging with external parties, here are some guidelines"

edwintorok6y ago

FrankPetrilli6y ago

We're aware this happened - that posting is the responsibility of an adjacent team to my own, specifically the person right next to me. :)

walshemj6y ago

Sounds like back hoe fade (from the write up) and it sounds like multiple cables sharing the same physical route got taken out.

harshreality6y ago· 8 in thread

Hacker News: The real status page and help desk for the internet.

Do companies realize how absurd this is?

boulos6y ago

The outage information is fairly reasonable. Not everyone cares (nor should they!) about the why only what the situation is, and that people are on it. This is extra detail.

hajhatten6y ago

This reminds me of an incident in Sweden a couple of years ago.

We test our disaster alarms on a known schedule. And just a couple of years ago, during the peak vacation time in the summer, the alarm went off, off schedule.

News outlets were also struggling, but did way better.

The only thing that withstood the sudden burst in traffic without a hitch was facebook and twitter.

The official statement i think was that the alarm was triggered by accident (never happened before, i think). But goes to show how badly our emergency response is setup.

issati6y ago

1 more reply

notatoad6y ago

seriously, they've got a text field on the official status page, why not put the text boulos posted here in that instead of the meaningless text they've got there?

david-cako6y ago

1 more reply

boulos6y ago

3 more replies

NikolaeVarius6y ago

You seem to have forgotten twitter

benburleson6y ago

We can dream.

username4446y ago· 8 in thread

Cloudflare was returning a 502 this morning, wonder if they're related. Lots and lots of sites down for about an hour, including all of Shopify.

boulos6y ago

As jgrahamc (Cloudflare CTO) noted below, these aren't related. They had a push that they rolled back, we lost some fiber links.

jhgg6y ago

Cloudflare took us down this morning, but also shielded us from the impact of this fiber cut, due to direct peering with google (I’m assuming over different fiber paths.)

benbristow6y ago

I highly doubt Google are using CloudFlare networks. Must be just a coincidence.

totaldude876y ago

or CloudFlare using GCP :)

1 more reply

mirceal6y ago

nope. cloudflare had a bad push / deployment.

naniwaduni6y ago

"bad push / deployment" seems like it covers 108% of breakage.

1 more reply

autoexec6y ago

jgrahamc6y ago

Not related

hnaccy6y ago· 7 in thread

What's the actual number of 9s for the major cloud services these days?

My impression from their PR seems to mismatch the number of outages and issues lately.

Johnny5556y ago

For an individual instance, they only promise 90% availability.

user59944616y ago

Availability of what? I've noticed entire afternoon where it wasn't possible to provision instances of some types, when I was working with AWS daily.

2 more replies

itslennysfault6y ago

What a craptactular SLA.

1 more reply

LoSboccacc6y ago

(and then there's soft layer which has multiple unplanned sev1 per week)

zzzcpan6y ago

I don't think there is any mismatch, it was always three nines. I guess the only mismatch is in claims that three nines is enough to not be noticeable or annoying to people.

nabla96y ago

I think Google has 99.5% in most of their services.

Due to network failures etc. number visible to customers is unlikely to be going to be much higher than that in cloud services.

StreamBright6y ago

Depending on which metric you are talking about. There roughly 10 metrics you might be thinking of.

mrmattyboy6y ago· 6 in thread

To whomever commented something like 'laughs in AWS' (comment was removed before I submitted the comment)...

please don't...

glass house and all that... but I also share the same glass house as you.. I don't want bad luck

... and it's only a fluke that this happened to google in eu-east1 and not AWS in X region and then you (and I) would be having a time of hell! :/

outworlder6y ago

Google seems to be more forthcoming with their issues. We have seen incidents in AWS where the status never got updated, but support confirmed issues.

deanCommie6y ago

Show me a GCP post-mortem that's as detailed and proactive about future improvement as https://status.aws.amazon.com/s3-20080720.html

Their last one was laughable in it's lack of self-awareness.

1 more reply

dymk6y ago

And did we forget about the insane AWS east outages of two years ago?

fjp6y ago

During that AWS outage I was training people on [enterprise software] as part of the certification portion of [enterprise software company annual conference].

Nobody really wanted to be [enterprise software]-certified, but it was a way to get their employers to pay for them to go to the conference with cool talks and perks and such.

We delayed the training most of the day, and couldn't say it was AWS' fault because they were sitting in the audience, waiting to get certified.

People were about to riot, that was not a fun day.

mirceal6y ago

i don't quite follow your logic. something about glass houses and bad luck?

mrmattyboy6y ago

So my point was to _not_ to laugh at those at google (or those using their services), because AWS might be next.

inlined6y ago· 5 in thread

Holy crap. It’s an outage in all zones? What’s the point of AZs if you lose whole DCs at a time.

dragonwriter6y ago

> What’s the point of AZs if you lose whole DCs at a time.

The point is that AZs are higher level than DCs, so that they provide pretty decent independence guarantees (though you can further derisk with multi-region.)

[0] https://cloud.google.com/compute/docs/regions-zones/

[1] https://docs.aws.amazon.com/whitepapers/latest/aws-overview/...

klodolph6y ago

Availability is hierarchical.

neonate6y ago

Can you explain that more?

1 more reply

MrStonedOne6y ago

Operational Consistency creates a hidden single point of failure

dekhn6y ago

regions are the point. this is known as a "meteor outage".

dx876y ago· 5 in thread

tomschlick6y ago

Spotify is hosted on google cloud: https://www.wired.com/2016/02/spotify-moves-itself-onto-goog...

crusader766y ago

I think the point OP was trying to make was relating to google services and their dependencies on each other.

1 more reply

cbhl6y ago

In my personal opinion, you should move off of Google Play Music, but not because of the dependency on Google infrastructure.

https://9to5google.com/2018/05/23/google-play-youtube-music-...

https://www.digitaltrends.com/music/what-happens-to-google-p...

mav3rick6y ago

Spotify is on Google Cloud.

Thaxll6y ago

I'm a heavy user of Google Play for the last 6 years, I never had a single outage since I use the service ( multiple hours a day, 5 days a week )

Thaxll6y ago· 4 in thread

imroot6y ago

It's not independent fiber links if they use the same tube to get into the building...just ask any backhoe operator.

fredthomsen6y ago

my brother-in-law's construction company actually did just that. ground wasn't properly marked and the fiber got cut, multiple links

1 more reply

foobiekr6y ago

It’s surprisingly hard to avoid shared fate links and it’s one of the things I would have thought google would be expert at.

vinay_ys6y ago

It's not that hard. In India because of so much construction related digging cuts OFCs, we do the path planning quite well and our redundancies get tested quite regularly whether you want to or not.

2 more replies

z3t46y ago· 4 in thread

cthalupa6y ago

>World of Warcraft also used to run on one single (blade) server.

But even if you look at just the game server portion, there was a blade for every realm - you can't say World of Warcraft as a monolithic entity ran on a single blade server.

edwintorok6y ago

avocado46y ago

How can Stack Overflow run on a single server? Do you mean single cluster?

davedunkin6y ago

As of 2016, Stack Overflow ran on dozens of servers in two data centers.

https://nickcraver.com/blog/2016/03/29/stack-overflow-the-ha...

1 more reply

fastest9636y ago· 3 in thread

Here's the original issue: https://status.cloud.google.com/incident/cloud-networking/19...

boulos6y ago

Yeah, that happens sometimes based on which team notices, thinks it might be different and then opens an outage.

fastest9636y ago

1 more reply

joshuamorton6y ago

Hopefully the thread title can be updated. (If it were actually down, this thread would have been posted 3 hours ago and have 400+ comments).

digitalsanctum6y ago· 3 in thread

hunter2_6y ago

  $ host news.ycombinator.com
  news.ycombinator.com has address 209.216.230.240

https://whois.arin.net/rest/net/NET-209-216-230-0-1/pft?s=20...

M5 Computer Security

https://www.m5hosting.com

Unrelated: https://begthequestion.info/

MisterPea6y ago

The begs the question site is one of my pet peeves. Language is not moderated by a select few who want to claim it, this isn't France.

This is why Ebonics is still a valid form of English - as long as it is used consistently.

If everyone uses "begs the question" and everyone else understands it as "raises the question" then it is perfectly valid.

2 more replies

jsjohnst6y ago

partiallypro6y ago· 2 in thread

It's been down for 4 hours and it's just now being posted on HN? Is it intermittent?

boulos6y ago

Disclosure: I work on Google Cloud.

larkeith6y ago

From another comment, original issue [1] was closed at 9:12, so looks like they got it back up for a bit over an hour before it went down again. Post-mortem will be interesting.

[1] https://status.cloud.google.com/incident/cloud-networking/19...

bob332126y ago· 2 in thread

Down or just high latency? For some folks that is the same thing.

crankylinuxuser6y ago

is a 4 hr latency, "latency"?

You make a good point though. Downtime seems to be awfully overloaded.

geogram6y ago

On our tests the latency is surprisingly low (20-40ms) but it has an error rate of 10-30%.

noncoml6y ago· 2 in thread

Bad config push again?

gaogao6y ago

Running a betting pool on cloud service outage root causes would be fairly fun.

I'm going to guess load balancer cascading failures.

notriddle6y ago

Nope. Physical destruction of fiber-optic cables is to blame, according to the GC status page. https://status.cloud.google.com/incident/cloud-networking/19...

geogram6y ago· 2 in thread

Longer than 4 hours. We have stackdriver setup to monitor uptime/latency and its been acting up since 2am PST.

tlynchpin6y ago

geogram6y ago

Thanks. Good point. Regardless, gcloud has been having issues for nearly 12hours. (timezone agnostic)

codingslave6y ago· 2 in thread

Google engineers ran into a coding problem that wasnt on leetcode

dang6y ago

Please don't post unsubstantive comments here.

sieabahlpark6y ago

Now isn't that the truth

lgats6y ago· 1 in thread

Pretty sure I've read before that us-east1 is one of the older Google data centers presumably with older equipment

boulos6y ago

Disclosure: I work on Google Cloud.

pupdogg6y ago

> The disruptions with Google Cloud Networking and Load Balancing have been root caused to physical damage to multiple concurrent fiber bundles serving network paths in us-east1.

I am assuming some sort of construction zone at or nearby the facility and the backhoe operator dug in and accidently cut the cables?

verdverm6y ago

I've been working out of us-east1 all day and haven't noticed

pkaye6y ago

Looks like all that high end engineering talent and processes still has its limits.

dragonwriter6y ago

> The disruptions with Google Cloud Networking and Load Balancing have been root caused to physical damage to multiple concurrent fiber bundles

Is this concurrent damage to separated bundles or damage to colocated bundles?

saltminer6y ago

The title says "almost 4 hours" (was posted at around 3 PM EST), but the incident was created at 10:25 AM PST, which is 1:25 PM EST. Has it been more like 2 hours or is there more to this incident?

wwwpppddd6y ago

GAE and all other services still show green here, of course: https://status.cloud.google.com/

mountainofdeath6y ago

Another day, another Google outage. It feels like it's once a month this year

rco87866y ago

2019 has been a really rough year for GCP

garyb26y ago

Notice they did not get around posting the next status update on time.

awinter-py6y ago

do they not have extra hands on staff to dedup the messages? what's with the identical messages at 14:31, :44, :48? This happened last time too.

1 more reply

spullara6y ago

Wow. GCP is always a networking issue. Their QA on networking changes needs work. Maybe they should spend 20% on it.

j / k navigate · click thread line to collapse