A few learning experiences. Elastic was brand new to our mix, so not a lot of domain knowledge there. We discovered how dangerous a handful of curl commands could be with the 'stock' permissions the developers had and fixed that. Also became a nice conversation on code reviews, signoff, and deadlines. Also about actual DR readiness and how long it takes to actually restore. We got a lot of value out of that mistake.
She is still my favorite developer. I'd steal her for my personal team any day had she not been stolen from our group by another a couple years later. (shakes fist) As one would expect, she apologized and learned the lessons. She was one of the youngest we ended up giving root to a 12B document prod instance because I knew she would do it carefully and correctly.
Your dev lead/senior engineer should have never been "too busy" for letting someone do that on a first day unsupervised
The list never ends.
I swear folks post sock puppets just so they can tell at them...
- Dev installation guides with credentials to prod.
- Tests that delete everything, not just what they create.
- The obvious of giving access to delete production on the first day.
- Where they lacking backups?
Anyway, for a real story, back in 2002 Sears before they totally tanked tried to open a home decor business called The Great Indoors. I was one of their first retail employees hired to work the stock room at a new opening. First day on the job, a week before the place was going to open, someone spends about 30 seconds showing me how to operate some forklift-like crate carrier called a wave or something like that, and I'm supposed to move some stuff to another floor via freight elevator. I accidentally accelerate when trying to slow down and promptly destroy the elevator.
Store manager was irate and promptly spends 10 minutes screaming at me in front of everyone and then sends me home permanently. You know what? 21 year-old me internalized that shit and believed I was actually at fault, but in retrospect, that place was bullshit and both Sears and The Great Indoors deserved the fate of eventually going out of business. The ensuring two decades have been up and down, but I'm in a great place now. I hope, if this really happened, that this junior dev landed all right, too. Life is way too short to stick with a toxic workplace, and when you don't have a family to feed and still have the freedom to just walk away, you absolutely should.
https://old.reddit.com/r/cscareerquestions/comments/6ez8ag/a...
(And there are browser extensions that will turn all reddit links into old.reddit.com links).
Luckily we were pulling full snapshots of the relevant tables into CSV exports (for ingestion into Redshift for reporting), so it was just a matter of grabbing the most recent of those, reinserting the data, and having some of the warehouse workers do some cycle counts to spot check. Still was a nerve-wracking and awkward conversation with my boss, though, lol
We have simple rules to prevent significant data loss:
1. Deleting data is not permitted and mutable objects are highly discouraged. No deletes/modifications == greatly reduced possibility of data loss from app code errors.
I lied a little bit.
There are functions that can remove data, but they are hardwired to refuse to work unless it is beyond question that they can't remove live prod data.
For example, the function will refuse if the collection is not prefixed with "tmp" or "test.
For production objects we have a system where we basically do CoW and create new versions of the documents and vacuuming system that archives old versions. The archive is preserved for a minimum period of time (2 weeks) to give chance to react in case somebody makes a blunder and screws up some rules to remove too much.
2. Remove write access to the database from every single person. No single employee should have write access to the database. No single employee should bear responsibility of having to work with an account that can let them destroy the database. No employee has access to PROD credentials, which are generated automatically and not present in configuration or application server.
Instead, if you have a need to introduces some changes to the database, an app was created where you can write your code as a kind of job that can modify the database. The job is not allowed to take any parameters (only has a name) so that it is possible to audit what it is going to do exactly. This code then goes through regular development process including code reviews, automated tests, etc. Once deployed to PROD you can go to API and execute the job by name within PROD context.
Data loss prevention is a significant issue that no CTO should "dump" on his/her employees.
Also not accepting responsibility for everything that happens in their department is a sign of lack of leadership.