How would you make an arbitrary shuffle between arbitrary-lengthed vectors faster than an arbitrary gather/scatter to and from L1 Cache?
Its possible to have a chip code specialized transforms or have a many-to-many crossbar when N is small (ex: 16), but when N is large (ex: 1024 elements), its no longer easy to see how to build a high-speed permute operator.
-----------
That's the thing. People only use vpshufb because its WAY faster than L1 cache. If Intel made a faster gather/scatter, there wouldn't be much point to the vpshufb instruction. But the vpshufb instruction is so fast, because its so specialized and small. It only has to worry about a 16-byte permute.
In short: we ALREADY have an arbitrary permute instruction. Its called gather/scatter. That's not what programmers want however. (I mean, programmers want a faster gather/scatter... but... vpshufb programmers use that operator only because its faster than L1 cache)