undefined | Better HN

0 pointsfulafel1y ago0 comments

40 bits per clock in a 8-wide core gets you 5 bits per instruction, and we have AVX512 instructions to feed, with operand sizes 100x that (and there are multiple operands).

Modern chips do face the memory wall. See eg here (though about Zen 5) where they in the same vein conclude "A loop that streams data from memory must do at least 340 AVX512 instructions for every 512-bit load from memory to not bottleneck on memory bandwidth."

0 comments

2 comments · 2 top-level

adrian_b1y ago

The throughput of the AVX-512 computation instructions is matched to the throughput of loads from the L1 cache memory, on all CPUs.

Therefore to reach the maximum throughput, you must have the data in the L1 cache memory. Because L1 is not shared, the throughput of the transfers from L1 scales proportionally with the number of cores, so it can never become a bottleneck.

So the most important optimization target for the programs that use AVX-512 is to ensure that the data is already located in L1 whenever it is needed. To achieve this, one of the most important things is to use memory access patterns that will trigger the hardware prefetchers, so that they will fill the L1 cache ahead of time.

The main memory throughput is not much lower than that of the L1 cache, but the main memory is shared by all cores, so if all cores want data from the main memory at the same time, the performance can drop dramatically.

sudosysgen1y ago

The processors that hit this wall have many many cores per memory lane. It's just not realistic for this to be a problem with 2 lanes of DDR5 feeding 4 cores.

These cores cannot process 8 AVX512 instructions at once, in fact they can't do it at all, as it's disabled on consumer Intel chips.

Also, AVX instructions operate on registers, not on memory, so you cannot have more than one register being loaded at once.

If you are running at ~4 instruction per clock, to actually go ahead and saturate 40 bits per clock on 64 bit loads, you'd need 1/6 of instructions to hit main memory (not cache)!

j / k navigate · click thread line to collapse