undefined | Better HN

0 pointsmmillin5y ago0 comments

Given the blast radius of this (all regions appear to be impacted) along with the fact that services that don't rely on auth are working as normal, it must be a global authN/Z issue. I do not envy Google engineers right now.

0 comments

39 comments · 11 top-level

tarruda5y ago· 21 in thread

> I do not envy Google engineers right now.

A few years ago I released a bug in production that prevented users from logging into our desktop app. It affected about ~1k users before we found out and rolled back the release.

I still remember a very cold feeling in my belly, barely could sleep that night. It is difficult to imagine what the people responsible for this are feeling right now.

papito5y ago

When I was interviewing at Morgan Stanley, I asked "how do you do this job if a mistake can cost people money?".

The answer was "well, if you don't do anything, you make NO money".

xenocratus5y ago

I'm reminded of the quote from Thomas J. Watson:

> Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him. Why would I want somebody to hire his experience?

1 more reply

cvrjk5y ago

Welp, as a new grad there, I had brought down one very important database server on a Sunday night (a series of really unfortunate events). Multiple senior DBAs had to be involved to resuscitate it. It started functioning normally just a few hours before market open in HK. If it was any later, it would have been some serious monetary loss. Needless to say, I was sweating bullets. Couldn't eat anything the entire day lol. Took me like 2 days to calm down. And this was after I was fully shielded cuz I was a junior. God knows what would've happened if someone more experienced had done that.

3 more replies

onychomys5y ago

I work for an extremely famous hospital in the American midwest. We're divided into three sections, one for clinical work, one for research, and one for education. I always tell people that I'm pretty content being in research (which is less sexy than clinical), because if I screw something up, some PI's project takes ten months and one week instead of ten months. In clinical, if you screw something up, somebody dies! I just don't think I could handle that level of stress.

ignoramous5y ago

Same.

At AWS, I once took an entire AZ down of a public-facing production service (with a mis-typed command), but that was nothing compared to when I accidentally deleted an entire region via internal console (too many browser tabs). Thank goodness turned out to be unused / unlaunched, non-production stack. I felt horrible for hours despite zero impact (in both the cases).

papito5y ago

Jesus. One would think you'd have some safeguards for that. Even Dropbox will give you an alert if you try to nuke over 1,000 files. More reasons to COLOR CODE your work environments, if possible.

3 more replies

ardy425y ago

> At AWS, I once took an entire AZ down of a public-facing production service (with a mis-typed command), but that was nothing compared to when I accidentally deleted an entire region via internal console (too many browser tabs). Thank goodness turned out to be unused / unlaunched, non-production stack. I felt horrible for hours despite zero impact (in both the cases).

It seems like a design flaw for actions like that to be so easy. E.g.

> Hey, we detected you want to delete an AWS region. Please have an authorized coworker enter their credentials to second your command.

1 more reply

znpy5y ago

It reminds me of this: https://www.youtube.com/watch?v=30jNsCVLpAE -- "GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill"

ayoubElk5y ago

Irrelevant to the discussion, but I just wanted to say thank you for the categorized list of users I can follow on your profile!

2 more replies

robertlagrant5y ago

Who automates the automators? :)

ahachete5y ago

Doesn't AWS (and every big cloud/enterprise) follow best-practices for production operation like FIT-ACER? https://pythian.com/uncategorized/fit-acer-dba-checklist/

That's even more surprising to me.

zeven75y ago

Several years back when I was working at Google I made a mistake that caused some of the special results in the knowledge cards to become unclickable for a small subset of queries for about an hour. As part of the postmortem I had to calculate how many people likely tried to interact with it while it was broken. It was a lot and really made me realize the magnitude of an otherwise seemingly small production failure. My boss didn't give me a hard time, just pointed me toward documentation about how to write the report. And crunching the numbers is what really made me feel the weight of it. It was a good process.

I feel for the engineer who has to calculate the cost of this bug.

robotnikman5y ago

This sounds like a good practice and hopefully something they still do. Calculating the exact numbers would definitely help cement the experience and its consequences into your mind.

zymhan5y ago

Presumably there were more failures than a single engineer could've been responsible for here.

1 more reply

rplnt5y ago

If you alone were able to do it, then the system was designed badly. The bigger the impact, the more robust it has to be to prevent accidents.

a3_nm5y ago

The big mistake in the system is that everyone in the world is relying on Google services... These problems would have less impact with a more diverse ecosystem.

2 more replies

domano5y ago

I remember how one of our engineers had his docker daemon connected to production instead of his local one and casually did a docker rm -f $(docker ps -aq) .

Same thing happened to me but with CI, which felt bad enough already.

avidphantasm5y ago

"Hey let's make developers do two very different jobs, development, and operations. We'll call it DevOps. We'll save money. Everything will be fine."

alias_neo5y ago

No Engineer should have production access from their workstation. Period.

source: am Engineer =).

2 more replies

solids5y ago

Well, it's something that can happen to anyone, take it easy. When I made the transition from developer to manager and become responsible for this situations, at first every problem made me feel as you describe. Eventually what helped me to be free is the understanding that how we feel about a fact does not change anything about that fact.

cube005y ago

Don't be too hard on yourself, no dev works in a silo, there is usually user acceptance testing and product owner sign offs involved so they also have to wear some of this too.

mrcus5y ago· 7 in thread

Nope, especially considering the implications of this, with the amount of people working remotely. Google Meet, Classroom, etc. are down. This is probably literally costing billions every minute just in loss of productivity.

ants_a5y ago

Total world economic output is ~$150M / minute, so billions every minute is off by few orders of magnitude.

tzs5y ago

You are assuming that a minute of disruption can not cause more than a minute's loss of productivity. I don't think that assumption is justified.

Consider an exactly one minute outage that affects multiple things I use for work.

First, I may not immediately recognize that the outage is actually with some single service provider. If several things are out I'm probably going to suspect it is something on my end, or maybe with my ISP. I might spend several minutes thoroughly checking that possibility out, before noticing that whatever it was seems to have been resolved.

Second, even if I immediately recognize it for what it is and immediately notice when it ends it might take me several minutes to get back to where I was. Not everything is designed to automatically and transparently recover from disruptions, and so I might have had things in progress when the outage stuck that will need manual cleanup and restarting.

2 more replies

rocho5y ago

That figure seems way too low, what are your sources on it?

3 more replies

yashap5y ago

Indeed. Also, Google’s revenue is about $300K per minute. The value they provide is likely higher than that, but as you said, being able to send an email an hour later than you hoped it’s fine in most cases. Also, Google Search was fine, and that’s their highest impact product.

I’d guess actual losses to the world economy were more on the order of about $100K per minute, or about 1/3 of Google’s revenue. MAYBE a few hundred thousand per minute, though that seems unlikely with Search being unaffected, and everything else coming back. Certainly a far cry from billions per minute :)

demosito6665y ago

I never understood this type of calculation as it implies that time is directly converted into money. However, I struggle to come up with an example for this. Even the most trivial labor cases like producing paperclips don't seem to be directly converting time into profit: even you will make 10k units instead of 100k this hour, you don't sell them immediately. They bring revenue to the firm via a long chain of convoluted contracts (both legal and "transactional") which are very loosely coupled to the immediate output.

Nothing is operating at minute margins unless it's explicitly priced on a minutely basis, like a cloud service. Even if a worked on a conveyor belt can't produce paperclips without looking at Google Docs sheet all the time, this will be absorbed by the buffers down the line. And only if the worker will fail to meet her monthly target due to this, loss of revenue might occur. But in this case the service has to be down for weeks.

In case of more complex conversions of time into money, like in the most of intellectual work, this is even less obvious that short downtimes will cause any measurable harm.

PeterStuer5y ago

Besides the exaggerated figure, I always find these claims bizarre. Sure, there was some momentary loss, but aggregated over a month this will not even register.

optimalsolver5y ago

I was unable to watch the Mogwai - Autorock music video. :-(

Corrado5y ago

In a previous lifetime I removed an "unused" TLS certificate. It turns out that it was a production cert that was being used to secure a whole state's worth of computers.

In my defence, the cert was not labeled properly, nor was it used properly, and there was no documentation. It took us 2 days to create a new cert and apply it to our software and deliver it to the customer. Those were 2 days I'll never get back. However, when I was finished the process was documented and the cert was labeled, so I guess its a win.

ilikehurdles5y ago

Coincidentally, Google Authenticator was finally just updated on iOS after many years without update.

beyondcompute5y ago

I am not sure why are they allowing it. Meaning why aren’t services completely isolated? Isn’t it obvious that in an intertwined environment those things are bound to happen (as in “question of when, not if”)? I understand, in smaller companies that are limited in resources (access to good developers and pressure to get product to market as soon as possible) we have single points of failure all over the place. But “the smartest developers on the planet”? What is it if not short-sighted disregard for risk management theories and practices? I mean, Calendar and Youtube, say, should be completely separate services hosted in different places, their teams should not even talk to each other. Yes, they can use same software components, frameworks and technologies. Standardization is very welcome. But decentralization should be an imperative.

Edit: again downvotes started! Thanks to everyone “supporting freedom of expression” :)

robotnikman5y ago

I've been in that situation before at one of my previous jobs, where some important IT infrastructure when down for the whole company. Nowhere as big of a scale as this, but it was easily one of the most stressful moments of my life

ojosilva5y ago

If this does not improve soon, we're looking at one of the most significant outages in recent internet history, at least from the number of people impacted.

Diederich5y ago

Several others have shared their 'I broke things' experiences, and so I feel compelled to weigh in.

Many years ago, I was directly responsible for causing a substantial percentage of all credit/debit/EBT authorizations from every WalMart store world-wide to time out, and this went on for several days straight.

On the ground, this kind of timeout was basically a long delay at the register. Back then, most authorizations would take four or five seconds. The timeout would add more than 15 seconds to that.

In other words, I gave many tens of millions of people a pretty bad checkout experience.

This stat (authorization time) was and remains something WalMart focuses quite heavily on, in real time and historically, so it was known right away that something was wrong. Yet it took us (Network Engineering) days to figure it out. The root cause summary: I had written a program to scan (parallelized) all of the store networks for network devices. Some of the addresses scanned were broadcast and network addresses, which caused a massive amplification of return traffic which flooded the satellite networks. Info about why it took so long to discover is below.

Back in the 1990s, when this happened, all of the stores were connected to the home office via two way Hughes satellite links. This was a relatively bandwidth limited resource that was managed very carefully for obvious reasons.

I had just started and co-created the Network Management team with one other engineer. Basically prior to my arrival, there had been little systematic management of the network and network devices.

I realized that there was nothing like a robust inventory of either the networks or the routers and hubs (not switches!) that made up those networks.

We did have some notion of store numbers and what network ranges were assigned to them, but that was inaccurate in many cases.

Given that there were tens of thousands of networks ranges in question, I wrote a program creatively called 'psychoping' that would ICMP scan all of those network ranges with adjustable parallelism.

I ran it against the test store networks, talked it over with the senior engineers, and was cleared for takeoff.

Thing is, I didn't start it right away; some other things came up that I had to deal with. I ended up started it over a week after review.

Why didn't this get caught right away? Well, when timeouts started to skyrocket across the network, many engineers started working on the problem. None of the normal, typical problems were applicable. More troubling, none of the existing monitoring programs looked for ICMP at all, which is what I was using exclusively.

So of course they immediately plugged a sniffer into the network and did data captures to see what was actually going on. And nothing unusual showed up, except a lot of drops.

We're talking > 20 years ago, so know that "sniffing" wasn't the trivial thing it is now. Network Engineering had a few extremely expensive Data General hardware sniffers.

And to these expensive sniffers, the traffic I was generating was invisible.

Two things: the program I wrote to generate the traffic had a small bug and was generating very slightly invalid packets. I don't remember the details, but it had something to do with the IP header.

These packets were correct enough to route through all of the relevant networks, but incorrect enough for the Data General sniffer to not see them.

So...there was a lot of 'intense' discussions between Network Engineering and all of the relevant vendors. (Hughes, ACC for the routers, Synoptics and ODS for the hubs)

In the end, a different kind of sniffer was brought in, which was able to see the packets I was generating. I had helpfully put my userid and desk phone number in the packet data, just in case someone needed to track raw packets back to me.

Though the impact was great, and it scared me to death, there were absolutely no negative consequences. WalMart Information Systems was, in the late 1990s, a very healthy organization.

leonidasv5y ago

Makes sense, at work we have an application running on Google Cloud and everything seems to be working. So the outage is probably not at network or infrastructure level.

ikiris5y ago

Went to reply, then saw the username. My guess was lb layer

reddotX5y ago

yeah, not working in Europe

j / k navigate · click thread line to collapse