undefined | Better HN

0 pointsvardump6y ago0 comments

> I think memory-layout is the #1 issue these days.

Absolutely agree. Cache lines should be packed with data that is useful together. Memory streaming access patterns should be favored.

> CPU memory movement is still subpar compared to GPUs. AVX512 finally implements "scatter" operations, but GPUs have had highly-optimized "gather-scatter"...

Well, it goes both ways. CPU gather/scatter may be slow, but GPU memory access latency is astronomically high — talking about microseconds. Of course GPUs mask the latency with a ton of hardware threads. CPUs are memory access latency kings by far. GPUs do have amazing memory controllers when you have gather/scatter access patterns, as long as high latency is acceptable.

> Intel really needs to write more instructions like "pshufb" to handle more ways for register-to-register movement.

Yeah, it'd be useful, but not so critical when you're memory bound anyways. I often find myself having a lot of "free computation slots" for data shuffling while the CPU is waiting for the memory. Or in other words, memory stalls.

0 comments

2 comments · 1 top-level

dragontamer6y ago· 1 in thread

> Well, it goes both ways. CPU gather/scatter may be slow, but GPU memory access latency is astronomically high — talking about microseconds. Of course GPUs mask the latency with a ton of hardware threads. CPUs are memory access latency kings by far. GPUs do have amazing memory controllers when you have gather/scatter access patterns, as long as high latency is acceptable.

Oh, I mean gather/scatter to shared / local memory. General purpose gather/scatter is very high latency as you say (I think read/writes were like 500 nanoseconds to L1 cache, and far slower to L2 and VRAM), but gather/scatter to shared/local memory is basically limited by bank-conflicts (~32 cycles worst case, ~2 cycles best case).

I'm pretty sure AVX512 gather/scatter to L1 cache is still dozens of cycles for just 16 SIMD-lanes.

> Yeah, it'd be useful, but not so critical when you're memory bound anyways. I often find myself having a lot of "free computation slots" for data shuffling while the CPU is waiting for the memory. Or in other words, memory stalls.

Fair point. I presume you mean that you can shuffle data to L1 cache while waiting for L3 or DDR4 RAM instead.

What I really want is "shared memory" to be implemented on CPUs, and for AVX-lanes to be able to shuffle data to and from there independently of the L1 / L2 / L3 / DDR4 memory system.

vardumpOP6y ago

> I'm pretty sure AVX512 gather/scatter to L1 cache is still dozens of cycles for just 16 SIMD-lanes.

Yeah, last I checked, it performed like scalar loads and stores. I presume Intel intends to eventually optimize for buffered/L1 hit cases. I mean, why would those instructions even exist otherwise?

j / k navigate · click thread line to collapse