Typing the confimation and requesting to delete the snapshots.
He had two brosers open, one for development (of cloudformation, etc)... but someone did ask him to change a thing in prod.
Both browsers were identical. Only the account in the top right corner did change.
Both cloudformation stacks were identical (instance names, etc).
He had been all the morning launching and deleting the dev environment.
Team mates were joking loud around his table before the moment it did happen.
Sadly, he got fired (the company was proud of it's cost savy choices, didn't have other backups than a few days of snapshots, probably CTO choice).
Everybody has off days, or just instances where circumstances misalign in just the wrong way. To pretend otherwise is silly; instead, it's the leader's/team's responsibility to ensure that those sort of off days don't lead to massive losses via redundancy & the sort of measures we're talking about here & in the OP. Firing somebody in these circumstances just acts to severely reduce morale, since we all secretly know in our hearts that it very easily could have been us.
Firing in this case just seems retributive. It's not going to bring the lost data back, and you've just eliminated the very person who could have told you most about the chain of events leading to the incident in question to help you guard against it in the future. These incidents usually sound simple at the surface level ("I clicked the button in the wrong window") but often hint at deeper, perhaps even organizational, issues. A lack of team focus on reliability/quality, a lack of communication or trust about decisions made (or not made) by higher ups, or so on.
And they are probably the single least likely person to cause a similar incident again -- that person will now likely be double and triple checking their commands for eternity.
If your CTO scattered those landmines all over then "not stepping right" is not an error. It just sucks.
We had an admin in charge of our storage. He had worked with our old vendor's SAN for years, then we got a new SAN. Trained him/certified him etc. He "accidentally" shut down the entire SAN. That brought down the entire company for over 9 hours.
Fast forward two years later, he screwed up again and caused a storage outage affecting about 1100 VMs. Luckily not much data loss, but a painful outage.
Then a month ago, he offlines part of the SAN.
Some people never learn, and recognizing this early is usually better than letting someone continue to risk things.
These words reminded me a story of similar/different "flaps" and "landing gear" controls on a plane - where crashed airplanes were also blamed on pilots first, before a trivial engineering/UI solution was implemented: https://www.endsight.net/blog/what-the-wwii-b17-bomber-can-t...
This is why it's a good practice to include the environment name in the resource names when it makes sense. Even better, don't append the env name, but use it as a prefix, like ProdCustomerDb instead of CustomerDbProd. I also like to change the theme to dark mode in the production environments as most management UIs support this. One other neat trick is to color code PS1 in your Linux instances, like red for prod, green for dev.
This is definitely a nice one to add. Though I did work with someone once who believed that all servers should be 100% vanilla and reverted my environment colors.
In container-only shops with no ssh, this is less of an issue, and instead you rely on having different permissions and automations for different environments.
Basically, I had a habit of starting a new SQL Server Management Studio instance in its own window for each database I was working on. At some point this struck me as wasteful, for some reason, so I closed all my windows and opened all the databases in one window. Then sometime after that I went to delete the test database as a routine maintainance task, but of course I was used to clicking the database at the top of the left pane in SSMS, which was the test database when it was the only database in a window... but now happened to be the production database. Then five minutes later I got a call from the client company that used our system, to ask me if there was any maintainance going on because everyone's client had just crashed.
The horror when I realised.
It was educational, though. I don't think I'll make that particular mistake ever again. And my bosses were ace to be fair, probably because I worked my ass off to correct the mess that ensued.