> BTW, I haven't tried to get implement an hash function in a while
Multiply RAX, CONST1 / bswap RAX / XOR RAX, CONST2 / Multiply RAX, CONST3.
12 cycles of latency. CONST1 and CONST3 must be odd (bottom bit is 1). Pick CONST1, CONST2, and CONST3 out of /dev/urandom.
--------
BTW: This is exactly why latency didn't matter, because the 12-cycles of latency here are basically independent between loops. The next loop iteration would cut-the-dependency on RAX, allowing the next loop iteration's "RAX" to get a new register and execute independently.
--------
AESEnc is a good baseline, but you need 2 or 3 iterations of it to work well. AESEnc also works on 128-bit vector registers, but most people want something that works on the 64-bit registers.
If your data was already in XMM registers, AESEnc / AESDec will be great. Otherwise, 64-bit multiply is really good at shuffling those bits around. Take RAX (64-bit result), EAX (32-bit result), AX (16-bit result), or AL (8-bit result) as needed.