> The whole thing is about 10k APIs that all share the same cluster of 10 databases on the backend, which was never designed to scale like this. This company did 500 million revenue 2010 and now 15 billion this year, all running on this fking sql back end. They have a team of 500 devs writing for these apps, the complexity is unbelievable. No one knows how to untangle it and scale out to micro services.
This said, depending on your application and how it's split up quite often there are easy wins. An example I've seen is a company that had a huge batch reporting jobs for reporting at fixed intervals. Hammered the hell of out production, so was moved to pull data from a read only replica.
"micro-services" is the worst solution to ANY problem. Is not a solution, only another way to write a problematic app.
The problem here is complexity and "micro-services" is the MOST efficient way to add complexity.
And the real problem here is complexity X scalability, which requires simplicity and tunning, stuff that "micro-services" IS NOT mean to solve.
---
I work in the enterprise space, and you bet you can cut 70%* (every time you find stuff like this) of the code and stay single-master-DB even for some very large companies if you have a half-decent architecture.
p.d: Do note that cut code does not mean the app will end with just 30%, is just that 70% is trash to be redone.
HPE has single-image machines that can have up to 16 4th gen Xeons, which gives a top limit of 960 cores. IBM has POWER10 boxes that go up to 240 cores (but they are POWER 10 cores that can do, IIRC, up to 8 threads per core (increasing cache misses, but reducing unused execution units).
I'd say one of the only options is a HPE Superdome Flex machine but as you said they might run into other bottlenecks at this scale.
I can't fathom what a database is doing on the CPU so much, usually I run out of I/O (both disk and network) on 128 core machines before maxing the CPUs. Also the post says they have 4 machines and 10 databases, very strange.
If the 10 databases are independent, that seems like the easy way out --- siphon them off into separate clusters, and you should get some headroom; but if one database is 99% of the load, it won't be much.
Otherwise, you've got to find better hardware or partition the database somehow. The good news is, I don't think Azure has a 416 core server, but they do have 416 vCPU servers[1], at 2vCPU per core, 208 cores is a lot, but a) these are Skylake cores and b) you can get a similar core count in a dual socket Epyc board these days, and have a core that's much newer. Not sure if you can get one of those in a cloud though.
Edit to add: there's also a lot of potential to move compute out of the database. Without knowing anything about their queries, my experience has been the most expensive queries are either unnecessary table scans (which can often be fixed) or joins. For joins, sometimes you can fix those to run better, and sometimes it's better to do a 'client assisted join', first do an indexed query to get the ids of things you want, then do a big union of queries to get the details. You can tell me how disgusting that is, but it can turn a thing that takes one round trip and hard processing on the server into a thing that takes two round trips and is pretty easy for the server. Maybe SQL server is better at joins that MySQL though? Sometimes it might not be ok for data integrity/transactional reasons, but usually it's ok. Joining might be hard on the clients, too, but it's usually easier to add more database clients than to scale the database server.
[1] https://learn.microsoft.com/en-us/azure/virtual-machines/mv2...
It might run on ARM. IIRC, Ampere has some large ones with lots of memory bandwidth. Maybe CXL memory can also help mitigating any disk IO.
[1] https://old.reddit.com/r/sysadmin/comments/1cqn3qa/whats_the...
[2] https://old.reddit.com/r/ITManagers/comments/1cqa0cp/genai_i...
[3] https://old.reddit.com/r/sysadmin/comments/1cotpdb/how_is_wo...
[4] https://old.reddit.com/r/Ameristralia/comments/1cnyxsh/what_...
[5] https://old.reddit.com/r/Intune/comments/ncj7oa/ios_sso_exte...
What I did was make a table of all the queries that were being run on my backend, and I ordered them by the number of times that they were called and the cost of calling them (I honestly can't remember the measure I used for that but it was like cputime*memory or similar). I then did two things for the top queries.
1) Optimised them where I could.
2) Looked for where they were being used and tried to stop it.
(2) was very successful.
To figure that out, we'd need to look deep into what's happening in the machine, down to counting cache misses, memory bandwidth usage (per channel), QPI link usage (because NUMA), and, maybe, even go down to the usage stats of the CPU execution units.
When they mention a lot of what was stored procedures has been moved to external web services, I get concerned they replaced memory and CPU occupancy with it waiting for network IO.
Assuming the poster Aussiepete80 is Australian I should point out that the much higher salaries in the US and the favorable E3 visa has largely brain drained Australia of their best and brightest. This guys army (dozens) of DBAs is likely the residual.