We built a lot of in-house infra. We had few runbooks. We ran tons of high-scale services with a pretty low corresponding headcount. And we had a great culture, that was definitely more personality based than the rest of Big Tech. Individuals had Opinions (TM). On-call was about having a durable understanding of the system and sleuthing around to figure out what was happening (along with a strong emphasis on learning from failures and putting systems/practices into place to fix those.) And we, as engineers, were weird. My tech lead rolled into work at 11 AM every morning on their skateboard. I used to take a 2 hour daily break from 2-4 PM, go home at 6, and work again between 8-10 PM (ish, this was flexible) before I went to bed and was known to enjoy coffee with fried chicken. Our infrastructure and our services reflected our personality quirks and the interpersonal relationships we had in the office. On-call was one of my favorite things at the company because all the different personalities would weigh in with their different quirks and would offer different perspectives on big problems, and together we'd solve tough issues.
Somewhere along the ZIRP-age this all changed. We needed to become a Real Company (TM) now. We needed to have runbooks. We needed to have Boring Technology. We were poised for huge growth after all (aka our stock value shot through the roof.) The new managers we brought in needed to build out new teams to grow fast, fast, fast. How can we grow when we had such immature processes? We started building runbooks and hiring teams to manage them. Incidents started requiring 3-pages minimum of paperwork to document, which was enforced by an incident management team. Teams dreaded these incidents and where once we collaborated to fix problems, now teams became defensive and combative during incidents. Now we needed the incident management team to force engineers to cooperate during incidents because each team was trying to be as defensive as possible. Managers stopped thinking in terms of individuals with strengths and weaknesses and started thinking about headcount, both cost and allocation. Headcount became everything. With this change, attrition also spiked.
What used to take 1-2 engineers to spin up over a week or two now took months. We load tested our new services, but now we needed to make load tests repeatable and runnable in a runbook. Teams became extra defensive about features because the cost of every incident became so high. Nobody wanted to be the team that missed an integration test which came up as a cause of an incident. Our net reliability didn't actually change though the thinking was that repeatability would allow more seamless swapping of headcount. Program managers managed migrations and we started creating status meeting meetings which rolled up statuses of multiple ongoing initiatives cascading over multiple teams.
My own experience at the company went from having fun writing code and interacting with quirky folks at work to dealing with engineers who were dotting their is and crossing their ts in every aspect of their job. The managers treated us as headcount and so headcount we became. It's been a highly depressing arc to a job I loved and where I built a lot of high-scale code at, but perhaps the most frustrating has been watching our velocity decrease despite our headcount ballooning due to the overhead of programs, migrations, and incident management that developed their own bureaucracies. The saddest part is that the oldest parts of the company have become so essential and the bureaucracy so thick around them that replacing them has become really challenging. The parts of the company that were developed with the least care are the hardest to replace.