undefined | Better HN

0 pointsthrowaway8943455y ago0 comments

I worked at a company where someone deleted the production database by accident and the snapshot mechanism hadn't been working AND the alerting for the snapshot mechanism was also broken. Fortunately someone had taken a snapshot manually some weeks prior and they were able to restore from that and lose relatively little data (it was a startup, so one database was a big deal, but weeks worth of data was not such a big deal).

0 comments

txutxu5y ago

I worked at a company were someone deleted the production RDS and all the snapshots.

Typing the confimation and requesting to delete the snapshots.

He had two brosers open, one for development (of cloudformation, etc)... but someone did ask him to change a thing in prod.

Both browsers were identical. Only the account in the top right corner did change.

Both cloudformation stacks were identical (instance names, etc).

He had been all the morning launching and deleting the dev environment.

Team mates were joking loud around his table before the moment it did happen.

Sadly, he got fired (the company was proud of it's cost savy choices, didn't have other backups than a few days of snapshots, probably CTO choice).

Gene_Parmesan5y ago

Firing the person who happened to be at the wheel when a mistake like this occurs never seems like the right choice to me, especially if their performance to-date had otherwise been good.

Everybody has off days, or just instances where circumstances misalign in just the wrong way. To pretend otherwise is silly; instead, it's the leader's/team's responsibility to ensure that those sort of off days don't lead to massive losses via redundancy & the sort of measures we're talking about here & in the OP. Firing somebody in these circumstances just acts to severely reduce morale, since we all secretly know in our hearts that it very easily could have been us.

Firing in this case just seems retributive. It's not going to bring the lost data back, and you've just eliminated the very person who could have told you most about the chain of events leading to the incident in question to help you guard against it in the future. These incidents usually sound simple at the surface level ("I clicked the button in the wrong window") but often hint at deeper, perhaps even organizational, issues. A lack of team focus on reliability/quality, a lack of communication or trust about decisions made (or not made) by higher ups, or so on.

And they are probably the single least likely person to cause a similar incident again -- that person will now likely be double and triple checking their commands for eternity.

jacobsenscott5y ago

Agree. There is never a single cause to this kind of error. It takes a village. Someone didn't name things properly, someone else didn't store backups properly, someone else gave everyone root access to production, etc. It was inevitable the database would be deleted - doesn't matter who actually did it.

If your CTO scattered those landmines all over then "not stepping right" is not an error. It just sucks.

greedo5y ago

Sometimes. And sometimes they make the same mistake over and over.

We had an admin in charge of our storage. He had worked with our old vendor's SAN for years, then we got a new SAN. Trained him/certified him etc. He "accidentally" shut down the entire SAN. That brought down the entire company for over 9 hours.

Fast forward two years later, he screwed up again and caused a storage outage affecting about 1100 VMs. Luckily not much data loss, but a painful outage.

Then a month ago, he offlines part of the SAN.

Some people never learn, and recognizing this early is usually better than letting someone continue to risk things.

2 more replies

Lex-20085y ago

> These incidents usually sound simple at the surface level ("I clicked the button in the wrong window") but often hint at deeper, perhaps even organizational, issues.

These words reminded me a story of similar/different "flaps" and "landing gear" controls on a plane - where crashed airplanes were also blamed on pilots first, before a trivial engineering/UI solution was implemented: https://www.endsight.net/blog/what-the-wwii-b17-bomber-can-t...

Huggernaut5y ago

Nickolas Means has an absolutely wonderful set of talks on themes like this. Particularly relevant here I think, is his talk: "Who Destroyed Three Mile Island?" - which goes through the events that occurred at the nuclear power plant, the systemic problems, and how to find the "second stories" of why failures occurred.

https://www.youtube.com/watch?v=1xQeXOz0Ncs

_asummers5y ago

There's a really good book describing this phenomenon called Behind Human Error. It speaks of "first stories" and "second stories" and how in analysis of incidents, it is all too common to stop at the first story and chalk it up to human error, when the system itself allowed it to take place.

ahoka5y ago

"Both cloudformation stacks were identical (instance names, etc)."

This is why it's a good practice to include the environment name in the resource names when it makes sense. Even better, don't append the env name, but use it as a prefix, like ProdCustomerDb instead of CustomerDbProd. I also like to change the theme to dark mode in the production environments as most management UIs support this. One other neat trick is to color code PS1 in your Linux instances, like red for prod, green for dev.

greedo5y ago

I have my background colors configured for each environment so when I'm shelled into a server, I know exactly what I'm working with.

1 more reply

nitrogen5y ago

One other neat trick is to color code PS1 in your Linux instances, like red for prod, green for dev.

This is definitely a nice one to add. Though I did work with someone once who believed that all servers should be 100% vanilla and reverted my environment colors.

In container-only shops with no ssh, this is less of an issue, and instead you rely on having different permissions and automations for different environments.

YeGoblynQueenne5y ago

That's very similar to what happened to me - except I didn't delete any backups, thank the Great Old Ones. And I didn't get fired.

Basically, I had a habit of starting a new SQL Server Management Studio instance in its own window for each database I was working on. At some point this struck me as wasteful, for some reason, so I closed all my windows and opened all the databases in one window. Then sometime after that I went to delete the test database as a routine maintainance task, but of course I was used to clicking the database at the top of the left pane in SSMS, which was the test database when it was the only database in a window... but now happened to be the production database. Then five minutes later I got a call from the client company that used our system, to ask me if there was any maintainance going on because everyone's client had just crashed.

The horror when I realised.

It was educational, though. I don't think I'll make that particular mistake ever again. And my bosses were ace to be fair, probably because I worked my ass off to correct the mess that ensued.

shezi5y ago

When I worked in production environments, I used to set up little Firefox userscripts that would add a banner or anything visual to the production site. It's entirely client side and easy to customize.

j / k navigate · click thread line to collapse