undefined | Better HN

0 pointsConst-me7y ago0 comments

> while VPGATHERDD ZMM-register is 9-clock cycles.

Quite expected because RAM in general is much slower than registers. Even L1 cache on CPU / groupshared on GPU. On both CPUs and GPUs, you want to do as much work as possible with the data in registers.

The reason why it’s OK on GPU is massive parallelism hiding latency i.e. the hardware computes other threads instead of just waiting for data to arrive. But even on CPUs, if the requested data is not too far (in L1 or L2), hyperthreading in most modern CPUs does acceptable job in this situation.

> Intel CPUs do NOT have this feature, outside of vpshufb

Yes they have. If you prefer assembly, look vpermps instruction. It can permute lanes arbitrary even across 128-bit lanes (this is unlike vpshufb/vshufps/etc.), with permutation indices being taken from another vector register. Quite fast, specifically on Haswell/Broadwell/Skylake it’s 3 cycles latency, 1 cycle throughput, single micro-op.

0 comments

3 comments · 1 top-level

dragontamer7y ago· 2 in thread

> > Intel CPUs do NOT have this feature, outside of vpshufb

> Yes they have.

No they don't, but its very tricky to see why. You're blinded by vpshufb, and can't see how it can fail to solve some problems.

Lets take stream compaction as an example problem.

http://www.cse.chalmers.se/%7Euffe/streamcompaction.pdf

Stream Compaction can be used to remove redundant whitespace from strings, or to "compress" the raytracer rays so that they are all able to be read by a simple load (as opposed to a gather/scatter operation). How do you perform stream compaction?

Well, its a straightforward scatter operation: https://i.imgur.com/aIoO8dm.png

Now, how do you do this using vpshufb? You can't. vpshufb is backwards, and won't efficiently solve this stream compaction problem. GPUs can solve this problem very efficiently, but Intel's current implementation of AVX512 is missing the "backwards" vpshufb command, to perform this operation.

Or as I've been trying to say: vpshufb is equivalent to a "gather" over SIMD Registers. But Intel is MISSING a scatter over SIMD Registers. The instruction is just... not there. I've looked for it, and it doesn't exist. As such, GPU-code (such as the stream compaction algorithm) CANNOT be implemented efficiently on a CPU.

I mean, you can use vpscatterdd to implement it, but again... vpscatterdd is 17-clock cycles. That's way too slow.

> Quite expected because RAM in general is much slower than registers. Even L1 cache on CPU / groupshared on GPU. On both CPUs and GPUs, you want to do as much work as possible with the data in registers.

The L1 cache has the bandwidth to do it, it just isn't wired up correctly for this mechanism. Intel's load/store units can read/write 32-bytes at a time, in parallel across 2xload units + 1x store unit.

But the thing is: writing many small "chunks" of 4-bytes here and there (ie: a vpgatherdd / vpscatterdd operation) is not a contiguous group of 32-bytes. Therefore, Intel's cores lose a LOT of bandwidth in this case.

GPUs on the other hand, have a great-many number of load/store units. Effectively one per SIMD unit. As such, reading / writing to LDS "shared" memory on AMD GPUs can be done 32-at-a-time.

So the equivalent "vpgatherdd" over LDS cache will execute in something like 2-clock ticks on AMD GPUs (assuming no bank conflicts), while it'd take 9-clock ticks on Intel cores.

Again, LDS cache is so fast on GPUs, that it is effectively functioning as if any LDS-load/store is as fast as a vpshufb instruction. (Not quite: vpshufb is 1-clock tick, and doesn't have to worry about bank-conflicts. So vpshufb is still faster... but GPUs gather/scatter capabilities are downright incredible)

How long before you think Intel will implement a true crossbar so that the vpgatherdd and vpscatterdd instructions can actually execute quickly on a CPU?

-----------

GPUs actually implement a richer language for data-movement. Intel could very easily fix this problem by writing a "backwards vpshufb" instruction, but I'm not aware of anything that exists like that... or any plans to implement something like that.

Const-meOP7y ago

> Now, how do you do this using vpshufb? You can't.

You keep mentioning vpshufb despite it's unable to move data across 128 bit lanes.

Here's how to do that with vpermps, there's some overhead but not much, very likely much faster than RAM access even when in L1: https://stackoverflow.com/a/36951611/126995

Besides, new CPUs have AVX-512 that has vcompressps instruction just for that use case.

> The L1 cache has the bandwidth to do it

It has bandwidth, but there's extra latency involved. Registers are faster.

It's the same on GPU, that's why nVidia recommends these CUDA permute lanes intrinsics over scatter/gather RAM access.

Here's a recent article about people actually using these permute intrinsics achieving quite good results: https://news.ycombinator.com/item?id=19018240

dragontamer7y ago

Hmm, I'll have to study AVX512 more then.

Thanks for the discussion!

1 more reply

j / k navigate · click thread line to collapse