Anyone who has been part of that journey knows how painful it really is. A lot of times the systems to fail at all levels, and you have to redesign it from the first principles.
I have, but it depends what you mean.
Scenario 1: e-commerce SaaS (think: Amazon but whitelabel, and before CPUs even had AES instructions); Christmas was "fun".
Scenario 2: Video Games. The first day is the worst day when it comes to scale. Everything has to be flawless from day 0 and you get no warning as to what can go wrong.
Yet, somehow, I managed to make highly reliable systems.
In scenario 1; I had an existing system that had to scale up and down with load, this was before there was cloud and hardware had a 3-4 month lead time, so most of the effort was around optimising existing code, increasing job timeouts and "quenching" sources that were expensive. We used to also do so 'magic' when it came to serving requests that had session token or shopping cart cookie.
In scenario 2; we have a clean-room implementation and no legacy, which is a blessing but also a curse, there's no possibility to sample real usage: but you also don't need to worry about making breaking changes that are for the better. With legacy you have to figure out how to migrate to the new behaviour gradually.
So, pro's and con's... but it's not like handling huge load hasn't been done before, computers are faster than they ever have been and while my personal opinion is that operational knowledge is dying (due to general distain for people who actually used to run systems that scale: not just write hopeful "eventually consistent" yaml that they call deterministic) - the systems that do exist today hold your hand much better than they did for me 20 years ago.
And I ran 1% of web traffic with an ops team of 5 back then. So, idk what's going on here.
EDIT: Likely people are flagging me because I sound arrogant (or I hurt their feelings by talking bad about YAML-ops), but all I am doing is answering the question presented based on my experience.
I once worked on a team had to 100x scale a system whose downstream dependencies were various 3rd party APIs and data sources, most of which had no real SLAs to speak of and had extremely high variance in latencies and data transfer patterns. This basically required rearchitecting everything including our clients because the typical transactional request/response access pattern was too tightly coupled, and any hiccup in an external API quickly rippled up through the call-tree and caused outages 3+ services removed from ours. In some cases, the re-architecting went all the way to the UI.
Years later, I led a company-wide effort to optimize our entire user-facing application infrastructure to not fall over from sharply spiking user traffic, touching dozens of services across dozens of teams. We did a brief study and realized there was not a single common solution recommendation (like "tune your caches") we could give that would help all the teams because each one had very different resource usage patterns and hence different bottlenecks. Our approach was basically to farm the task out to each team and say "here are some common metrics to look into and some common issues to look for and some common solutions, get back to us if you need help." We spent a lot of time on the help.
I have no idea what the patterns for GitHub are, but I'll note its much more than just a DB and it has a dependency (Actions) with extremely high variance in latencies and resource usage.
I understand, that wasn't a comment on your efforts back then, just that it is a solved problem today. But that does not mean other scaling problems are comparable or comparably solved. The universe of scaling problems is immense!
Worse, different problems occur at different scales. In the 3rd party API system, years after the first re-architecting, some use-cases developed issues at scale that exceeded the already high operational parameters we benchmarked at, and required us to re-architect the service again, including building out a whole new cluster so we could isolate that traffic entirely.
It is really hard to predict how things will break until they do.
(As an aside, I remember reading a lot of interesting things about Blizzard's technology, even if Blizzard didn't publish those themselves. There were many people who researched their products and published their findings. For instance, someone analyzed wireshark traces and published a very detailed report about how they tuned their server-side networking stack. One thing that stood out was Blizzard used TCP for WoW, whereas the conventional wisdom was UDP for real-time multiplayer!)
For example, if you have TCP_NODELAY and a few thousand players, you'll be swimming in about 1.2M packets per second pretty quickly.
This is enough to completely crush any stateful firewalls (UDP would pass through because no need to check state), so we had to do ACLs in network hardware instead, and append a magic number so that we could prevent flooding instead.
Another thing we found was that Windows networking activity only happens on Core0 (Windows 2012 R2); and that at 1.2M PPS: the driver crashes.
Logging in to a Windows machine which is AD connected when its network interface is dead is not ideal.
So, yeah, avoid TCP.
GitHub's own April post-mortem names the causes in their own words: tight coupling allowing localised failures to cascade, and inability to shed load from misbehaving clients. Their March report says one of the March outages "shared the same underlying cause" as a February one - i.e. they hit the same rake twice in two months. Cascade isolation has a dedicated chapter in the SRE book from 2016. Load shedding is older than that; the Erlang/OTP people were writing about it in the 80s. This isn't research territory, it's a syllabus, and GitHub is fumbling it with Microsoft's chequebook behind them.
Amazon and Blizzard aren't the slam-dunk examples you want them to be either. Prime Day 2018 fell over because their auto-scaling failed and they had to manually add servers - that's not "well-known by now", that's a company at literal planetary scale getting caught short on the one day of the year it was guaranteed to matter. And Blizzard's Lord of Hatred launch this week is doing the exact same login-queue routine Diablo's done at every launch in living memory. If those are your "two decades of solved problems", the bar is on the floor.
Your 100x rearchitecture story actually argues my position, by the way. You described tight coupling causing cascading failures across services, and the fix was to decouple. That is the boring operational discipline I'm saying has atrophied - you and your team did the work. The point is GitHub, a decade later, with Microsoft's resources and thirty times the headcount, is putting out post-mortems that read like undergraduate distributed systems coursework.
So no - the question isn't whether GitHub's problem is hard. Every scaling problem looks hard from inside. The question is whether the operational discipline that solved this class of problem in the 2000s and 2010s is still being practised, or whether the industry has quietly decided "it's complicated" is sufficient cover.
1) The general cause of issues in these cases is that certain assumptions no longer hold, and above a certain level of complexity, there are too many assumptions to keep track of, and so things fail in surprising ways. Like, the need for auto-scaling was well-known and Amazon did have that solution in place. But I recall the 2018 Prime Day was record-breaking, so it is likely the very same auto-scaling service that was supposed to save them fell over because they forecast too conservatively! (As an aside, I follow a senior AMZN engineer who's made his career out of load-testing their services, and he has many fun war stories.)
2) The resiliency work is not done upfront because it is additional complexity that may not be needed. "You're not Google" and YAGNI is sound advice most of the times. So the system is designed with some "reasonable" assumptions (which... see above!) At larger companies, resiliency mechanisms (load-shedding etc.) are built into standard components, but then...
3) Different performance profiles require different resiliency mechanisms, and it's not always clear what they would be.
Going back to the example of the 3rd party API service, when we inherited it around ~2012, it was built on standard infrastructure components with in-built resiliency mechanisms... but those were designed for internal services with latencies expected in milliseconds, whereas our downstream calls could go into seconds or even minutes. Still, with the traffic then, with a little tuning it worked fine and served the company well... until we (or the 3rd party APIs!) hit a certain scale and started seeing issues. At this point we extrapolated the trends, benchmarked heavily, and re-architected. And then we hit new scales and new use-cases that surfaced new issues, so we had to re-architect again!
The point is, the system's performance profile was very different from typical web services (the primary culprits being extremely high variance in downstream characteristics and very non-linear growth) and it was non-obvious to scale with conventional wisdom. I do not know what's happening at GitHub, but I suspect they have some similarly unique performance aspects.
[1] https://github.blog/news-insights/company-news/an-update-on-...
Large increase, but nothing existential.
GitHub would have obligations to MS investors to make accurate projections just like Microsoft itself, right?
And that start by layoffing your best engineers, I guess