There is also VPERMI2B [0] which operates on a 128 byte LUT.
[0] https://en.wikichip.org/wiki/x86/avx512_vbmi