undefined | Better HN

0 pointsjeffffff1mo ago0 comments

> None of the things people care about really get much out of "unified memory". GPUs need a lot of memory bandwidth, but CPUs generally don't and it's rare to find something which is memory bandwidth bound on a CPU that doesn't run better on a GPU to begin with. Not having to copy data between the CPU and GPU is nice on paper but again there isn't much in the way of workloads where that was a significant bottleneck.

the bottleneck in lots of database workloads is memory bandwidth. for example, hash join performance with a build side table that doesn't fit in L2 cache. if you analyze this workload with perf, assuming you have a well written hash join implementation, you will see something like 0.1 instructions per cycle, and the memory bandwidth will be completely maxed out.

similarly, while there have been some attempts at GPU accelerated databases, they have mostly failed exactly because the cost of moving data from the CPU to the GPU is too high to be worth it.

i wish aws and the other cloud providers would offer arm servers with apple m-series levels of memory bandwidth per core, it would be a game changer for analytical databases. i also wish they would offer local NVMe drives with reasonable bandwidth - the current offerings are terrible (https://databasearchitects.blogspot.com/2024/02/ssds-have-be...)

0 comments

AnthonyMouse1mo ago

> the bottleneck in lots of database workloads is memory bandwidth.

It can be depending on the operation and the system, but database workloads also tend to run on servers that have significantly more memory bandwidth:

> i wish aws and the other cloud providers would offer arm servers with apple m-series levels of memory bandwidth per core, it would be a game changer for analytical databases.

There are x64 systems with that. Socket SP5 (Epyc) has ~600GB/s per socket and allows two-socket systems, Intel has systems with up to 8 sockets. Apple Silicon maxes out at ~800GB/s (M3 Ultra) with 28-32 cores (20-24 P-cores) and one "socket". If you drop a pair of 8-core CPUs in a dual socket x64 system you would have ~1200GB/s and 16 cores (if you're trying to maximize memory bandwidth per core).

The "problem" is that system would take up the same amount of rack space as the same system configured with 128-core CPUs or similar, so most of the cloud providers will use the higher core count systems for virtual servers, and then they have the same memory bandwidth per socket and correspondingly less per core. You could probably find one that offers the thing you want if you look around (maybe Hetzner dedicated servers?) but you can expect it to be more expensive per core for the same reason.