Biggest CPU for the Bad System (opens in new tab)

(old.reddit.com)

50 pointsopentokix2y ago32 comments

32 comments

23 comments · 8 top-level

h2odragon2y ago· 6 in thread

https://old.reddit.com/r/sysadmin/comments/1cqn3qa/whats_the...

> The whole thing is about 10k APIs that all share the same cluster of 10 databases on the backend, which was never designed to scale like this. This company did 500 million revenue 2010 and now 15 billion this year, all running on this fking sql back end. They have a team of 500 devs writing for these apps, the complexity is unbelievable. No one knows how to untangle it and scale out to micro services.

shermantanktop2y ago

Unfortunately the word “microservices” on HN brings out some low-quality discourse. Maybe this time will be different?

pixl972y ago

For this particular instance and company, no. Not because microservices can't work, but instead because of the coordination required between services/api/database that doesn't appear to exist at all in this company.

This said, depending on your application and how it's split up quite often there are easy wins. An example I've seen is a company that had a huge batch reporting jobs for reporting at fixed intervals. Hammered the hell of out production, so was moved to pull data from a read only replica.

rbanffy2y ago

It really depends on what they are doing. At this stage, I'd suspect it needs some major rearchitecture, but they accumulated so much tech debt it's difficult to say what next steps could be. As was pointed out by others, starting by introducing some strong governance practices would be a good start.

1 more reply

hot_gril2y ago

The important thing in these convos is to be clear each time what you mean by "microservices," because there isn't a solid definition for that. I've seen too many debates at work over whether we should use microservices, where both sides already agree on the actual design.

mamcx2y ago

Adding micro-services is the way to make the company bankrupt.

"micro-services" is the worst solution to ANY problem. Is not a solution, only another way to write a problematic app.

The problem here is complexity and "micro-services" is the MOST efficient way to add complexity.

And the real problem here is complexity X scalability, which requires simplicity and tunning, stuff that "micro-services" IS NOT mean to solve.

---

I work in the enterprise space, and you bet you can cut 70%* (every time you find stuff like this) of the code and stay single-master-DB even for some very large companies if you have a half-decent architecture.

p.d: Do note that cut code does not mean the app will end with just 30%, is just that 70% is trash to be redone.

hot_gril2y ago

Are you saying that every company up to a certain size should have a single master DB?

ktpsns2y ago· 4 in thread

It's hard to believe for me how you would not start buying own hardware at this scale. In particular when the hyperscalers (at first glance) don't have anything to provide to match the needs.

rbanffy2y ago

The biggest x86 machine I found tops out at 960 cores, but I'm not sure what exactly they need, if having more cores would solve their problems or would only make some other pipe burst.

To figure that out, we'd need to look deep into what's happening in the machine, down to counting cache misses, memory bandwidth usage (per channel), QPI link usage (because NUMA), and, maybe, even go down to the usage stats of the CPU execution units.

When they mention a lot of what was stored procedures has been moved to external web services, I get concerned they replaced memory and CPU occupancy with it waiting for network IO.

cjbgkagh2y ago

I would hazard a guess that they're not really CPU bound.

Assuming the poster Aussiepete80 is Australian I should point out that the much higher salaries in the US and the favorable E3 visa has largely brain drained Australia of their best and brightest. This guys army (dozens) of DBAs is likely the residual.

1 more reply

bee_rider2y ago

If they need really 1600 cores, perhaps they should be looking at a cluster with an Infiniband connection? Infiniband isn’t cheap but they have to beat nearly a million dollars per month which gives them a lot of headroom.

1 more reply

hot_gril2y ago

It might have uptime requirements that they can't provide on their own.

rbanffy2y ago· 3 in thread

It's hard to give significant advice with this little information - how much time the CPUs spend waiting for the memory, how many cache misses are happening, how many core execution units are doing something at any given time, etc.

HPE has single-image machines that can have up to 16 4th gen Xeons, which gives a top limit of 960 cores. IBM has POWER10 boxes that go up to 240 cores (but they are POWER 10 cores that can do, IIRC, up to 8 threads per core (increasing cache misses, but reducing unused execution units).

aeyes2y ago

Does SQL Server run on IBM Power?

I'd say one of the only options is a HPE Superdome Flex machine but as you said they might run into other bottlenecks at this scale.

I can't fathom what a database is doing on the CPU so much, usually I run out of I/O (both disk and network) on 128 core machines before maxing the CPUs. Also the post says they have 4 machines and 10 databases, very strange.

toast02y ago

4 machines is easy. Two pairs of redundant pairs. That's the minimum machine count I would run for an important use case (well, maybe 3x, one in each of three places)

If the 10 databases are independent, that seems like the easy way out --- siphon them off into separate clusters, and you should get some headroom; but if one database is 99% of the load, it won't be much.

Otherwise, you've got to find better hardware or partition the database somehow. The good news is, I don't think Azure has a 416 core server, but they do have 416 vCPU servers[1], at 2vCPU per core, 208 cores is a lot, but a) these are Skylake cores and b) you can get a similar core count in a dual socket Epyc board these days, and have a core that's much newer. Not sure if you can get one of those in a cloud though.

Edit to add: there's also a lot of potential to move compute out of the database. Without knowing anything about their queries, my experience has been the most expensive queries are either unnecessary table scans (which can often be fixed) or joins. For joins, sometimes you can fix those to run better, and sometimes it's better to do a 'client assisted join', first do an indexed query to get the ids of things you want, then do a big union of queries to get the details. You can tell me how disgusting that is, but it can turn a thing that takes one round trip and hard processing on the server into a thing that takes two round trips and is pretty easy for the server. Maybe SQL server is better at joins that MySQL though? Sometimes it might not be ok for data integrity/transactional reasons, but usually it's ok. Joining might be hard on the clients, too, but it's usually easier to add more database clients than to scale the database server.

[1] https://learn.microsoft.com/en-us/azure/virtual-machines/mv2...

rbanffy2y ago

That's a good point. I guess they shot themselves in the foot big time.

It might run on ARM. IIRC, Ampere has some large ones with lots of memory bandwidth. Maybe CXL memory can also help mitigating any disk IO.

__turbobrew__2y ago· 2 in thread

This seems like a good use case for Spanner? The pain would be in migrating the backend to gke, but is you are hitting the limits of what azure can do you are going to have to migrate at some point.

hot_gril2y ago

It'd basically require rewriting everything, and even then, Spanner isn't necessarily a good way to do it.

rbanffy2y ago

You could have some breathing room by scaling up to a bigger box (AFAIK, x86 tops out at 920 cores per memory image)

upon_drumhead2y ago

This is fake, just rage bait. Besides the numbers in the post just not making any sense, the OP states that the company is in healthcare[1] but then says he's a 43 year old director[2], which still tracks, but then he says he's been 20 years in "big law"[3], then as a it director in fintech[4]. He says he's changed jobs twice in the last two years[5]. I gave up looking after just the first page of his post history.

[1] https://old.reddit.com/r/sysadmin/comments/1cqn3qa/whats_the...

[2] https://old.reddit.com/r/ITManagers/comments/1cqa0cp/genai_i...

[3] https://old.reddit.com/r/sysadmin/comments/1cotpdb/how_is_wo...

[4] https://old.reddit.com/r/Ameristralia/comments/1cnyxsh/what_...

[5] https://old.reddit.com/r/Intune/comments/ncj7oa/ios_sso_exte...

sgt1012y ago

I ran a stressed app some years ago. We only had a wee little backend because our revenue was v.low, but we wanted to do stuff like sleep inside and eat, and so were motivated to cut costs to make profit.

What I did was make a table of all the queries that were being run on my backend, and I ordered them by the number of times that they were called and the cost of calling them (I honestly can't remember the measure I used for that but it was like cputime*memory or similar). I then did two things for the top queries.

1) Optimised them where I could.

2) Looked for where they were being used and tried to stop it.

(2) was very successful.

tristor2y ago

Having been through similar situations in my past life, I can confidently say that they don't need more CPU cores, they need to start really looking at their architecture holistically and identifying the critical path that can be rewritten in priority order for performance. At this point, throwing more hardware at the problem is the wrong thing to do /even/ if it temporarily kicks the can down the road. They have a fundamental system design issue that needs to be addressed, likely piecemeal and prioritized. The first step should be adding more performance instrumentation.

coolkil2y ago

Shame it is running ms sql. anything Postgress, oracle or db2 and it might have been a candidate for running on a IBM Linuxone might even be a valid contender for the cost it is currently running at.

j / k navigate · click thread line to collapse

32 comments

23 comments · 8 top-level

h2odragon2y ago· 6 in thread

https://old.reddit.com/r/sysadmin/comments/1cqn3qa/whats_the...

shermantanktop2y ago

Unfortunately the word “microservices” on HN brings out some low-quality discourse. Maybe this time will be different?

pixl972y ago

rbanffy2y ago

1 more reply

hot_gril2y ago

mamcx2y ago

Adding micro-services is the way to make the company bankrupt.

"micro-services" is the worst solution to ANY problem. Is not a solution, only another way to write a problematic app.

The problem here is complexity and "micro-services" is the MOST efficient way to add complexity.

And the real problem here is complexity X scalability, which requires simplicity and tunning, stuff that "micro-services" IS NOT mean to solve.

---

p.d: Do note that cut code does not mean the app will end with just 30%, is just that 70% is trash to be redone.

hot_gril2y ago

Are you saying that every company up to a certain size should have a single master DB?

ktpsns2y ago· 4 in thread

It's hard to believe for me how you would not start buying own hardware at this scale. In particular when the hyperscalers (at first glance) don't have anything to provide to match the needs.

rbanffy2y ago

The biggest x86 machine I found tops out at 960 cores, but I'm not sure what exactly they need, if having more cores would solve their problems or would only make some other pipe burst.

When they mention a lot of what was stored procedures has been moved to external web services, I get concerned they replaced memory and CPU occupancy with it waiting for network IO.

cjbgkagh2y ago

I would hazard a guess that they're not really CPU bound.

1 more reply

bee_rider2y ago

1 more reply

hot_gril2y ago

It might have uptime requirements that they can't provide on their own.

rbanffy2y ago· 3 in thread

aeyes2y ago

Does SQL Server run on IBM Power?

I'd say one of the only options is a HPE Superdome Flex machine but as you said they might run into other bottlenecks at this scale.

toast02y ago

4 machines is easy. Two pairs of redundant pairs. That's the minimum machine count I would run for an important use case (well, maybe 3x, one in each of three places)

[1] https://learn.microsoft.com/en-us/azure/virtual-machines/mv2...

rbanffy2y ago

That's a good point. I guess they shot themselves in the foot big time.

It might run on ARM. IIRC, Ampere has some large ones with lots of memory bandwidth. Maybe CXL memory can also help mitigating any disk IO.

__turbobrew__2y ago· 2 in thread

This seems like a good use case for Spanner? The pain would be in migrating the backend to gke, but is you are hitting the limits of what azure can do you are going to have to migrate at some point.

hot_gril2y ago

It'd basically require rewriting everything, and even then, Spanner isn't necessarily a good way to do it.

rbanffy2y ago

You could have some breathing room by scaling up to a bigger box (AFAIK, x86 tops out at 920 cores per memory image)

upon_drumhead2y ago

[1] https://old.reddit.com/r/sysadmin/comments/1cqn3qa/whats_the...

[2] https://old.reddit.com/r/ITManagers/comments/1cqa0cp/genai_i...

[3] https://old.reddit.com/r/sysadmin/comments/1cotpdb/how_is_wo...

[4] https://old.reddit.com/r/Ameristralia/comments/1cnyxsh/what_...

[5] https://old.reddit.com/r/Intune/comments/ncj7oa/ios_sso_exte...

sgt1012y ago

1) Optimised them where I could.

2) Looked for where they were being used and tried to stop it.

(2) was very successful.

tristor2y ago

coolkil2y ago

Shame it is running ms sql. anything Postgress, oracle or db2 and it might have been a candidate for running on a IBM Linuxone might even be a valid contender for the cost it is currently running at.

j / k navigate · click thread line to collapse