undefined | Better HN

0 pointspbsd12y ago0 comments

Oh, I missed that. That makes things trickier, but I think we can still get away with something like

  vmovdqu xmm7, [rdi + rax + 5 - 1]
  vpinsrb xmm7, xmm7, [rdi + rax + 0], 0

without too much of a performance penalty. The adjustments to offsets then can be put into the shuffle tables, so there should be no further significant performance loss.

0 comments

3 comments · 1 top-level

nkurz12y ago· 2 in thread

Yes, I think that should work to guarantee two per vector. I hadn't previously considered trying to do that, and appreciate the suggestion and the sketch. I think I have a slightly faster (7 cycle) approach doing one at at time using a 64-bit register as a lookup for the sum of the middle two fields, but this has good promise. Especially if we can get out one farther ahead, so instead of having the vector reload on the critical path, the unused portion of the current vector and a preload can be 'slid' into place. Do you know if there is a good way to simulate a PALIGN but with a non-immediate operand? This might get down to 9-10 cycles for two keys.

pbsdOP12y ago

I have no idea how to simulate a variable PALIGNR on Intel chips without making the loop extremely slow.

On AMD (with XOP), it can be done using VPPERM, which can shuffle from 2 sources. We can do variable alignment like this:

  vpperm xmm0, xmm1, xmm2, [[0..31] + offset]

On second thought, we can possibly do something similar on Intel using 2 pshufb and a blend.

nkurz12y ago

On second thought, we can possibly do something similar on Intel using 2 pshufb and a blend.

I tried for a bit, but haven't figured out how to make that work. PSHUFB needs a different XMM operand for each 'rotate'. Loading this operand would take 6 cycles, and I haven't thought of a clever way of generating it in less.

I do greatly appreciate your help, though. Thanks!

j / k navigate · click thread line to collapse