I don't envy their position of having to scale that fast on something that has to be instant and real-time. As far as I know, you can't do CDN/edge caching shenanigans with a remote git repository like Google can with a YouTube video. It's gotta always be reading/writing to the latest, single source of truth.
The easy solutions like caching and read replicas don't work and you're forced to go the route of sharding or similar techniques that have much more painful tradeoffs.
I'm not sure if that's why everything keeps breaking but at that scale write-heavy workloads are never going to be easy
However, they have reported numbers along rather inconsistent dimensions. Like, historically they've focused on number of repos and users and later PR's and issues, and often catch-all terms like "contributions" which includes all of those + comments etc... but the number of commits alone (which apparently is the main culprit now?) has been mentioned very sporadically. This has made it hard to get a consistent sense of historical growth.
Without any other information, however, it is reasonable to assume that a 14x in commits is the prime candidate for instability. Especially since commits are write traffic, which is much harder to scale than read traffic. Plus every 3 - 5x increase in scale can reveal bottlenecks in your distributed systems that you never knew existed, so they probably have like 2 - 3 "generations" of bottlenecks to figure out!
Think about countless actions that have to run almost at every push and PR push! Also, remember that we were used to use external services for "actions", and they basically killed the competition by offering their own CI actions at no cost to most users.
Also, they did a lot of reworks in the last years, not necessarily for the best like the PR diff page, and probably not in the most efficient way.