undefined | Better HN

0 pointsbloomer7y ago0 comments

No he's not getting hung up on the term vector. He is saying that SIMD is fixed but wider width and that is not equivalent to arbitrary width. For example, SIMD shuffle type instructions have no arbitrary length equivalent.

0 comments

4 comments · 1 top-level

lallysingh7y ago· 3 in thread

Why not? You take the new ordering as an input vector.

dragontamer7y ago

How would you make an arbitrary shuffle between arbitrary-lengthed vectors faster than an arbitrary gather/scatter to and from L1 Cache?

Its possible to have a chip code specialized transforms or have a many-to-many crossbar when N is small (ex: 16), but when N is large (ex: 1024 elements), its no longer easy to see how to build a high-speed permute operator.

-----------

That's the thing. People only use vpshufb because its WAY faster than L1 cache. If Intel made a faster gather/scatter, there wouldn't be much point to the vpshufb instruction. But the vpshufb instruction is so fast, because its so specialized and small. It only has to worry about a 16-byte permute.

In short: we ALREADY have an arbitrary permute instruction. Its called gather/scatter. That's not what programmers want however. (I mean, programmers want a faster gather/scatter... but... vpshufb programmers use that operator only because its faster than L1 cache)

lallysingh7y ago

Specialize for the case of small vectors.

There's a front-end operating cost to the bookkeeping instructions, and a complexity cost to all the variants of the same instructions. For short vectors, the cpu can use the same hardware it uses today in SIMD, just that the SIMD work is in microcode instead of asm. The cost of fetching the permutation vector arg out of L1 isn't terribly high compared to the cost of fetch/decode on the bookkeeping instructions. And the cost of supporting all those instruction variants could be replaced with more functionality the front-end.

glangdale7y ago

Seriously. Permutes get harder as you scale - VBMI on CNL is an indicator that 64-way is pretty good but it's still considerably more expensive than 4 16-way permutes on the same architecture.

There's a reason that gather is hard to do; I think if you rocked up and asked the architecture guys for a gather that was competitive with small-scale permute they would reply with the time-honored Intel putdown ("You are overpaid for whatever it is you do").

2 more replies

j / k navigate · click thread line to collapse