You can fit twenty-one 3-bit entries in a 64 bit int, with one bit to spare.
So naively you get:
table[nlz / 21] >> (nlz % 21)
Which involves a division and a modulo. And that's assuming 63 entries in total, I'm not even trying to handle fitting the 64th entry in those three leftover bits somehow, or spread them across the three ints (again, I don't see how to do that without division).Alternatively, each integer could contain one bit of the output, so:
((table[0] >> nlz) << 2) + ((table[1] >> nlz) << 1) + (table[1] >> nlz)
Which is five shifts and two additions, so more work plus lookup.If you see another method I overlooked please tell me.
> Which is five shifts and two additions, so more work plus lookup.
Something like that, perhaps with more bitmasking and less addition. There is no lookup, just three constants loaded into different registers and handled by different out of order execution units, then combined in a final addition / or-ing. So what you will "see" on the critical path is two bitshifts and three bitwise ORs/ANDs.
It's not faster, the point is that the speedup of "binary division" is probably not worth the day or so spent developing it unless in extreme cases.
But yes, extremely unlikely to be a hot path in most situations.
Timing of instructions for Intel can be found here: www.agner.org/optimize/instruction_tables.pdf
Best I can come up with is using 32 8-bit integers (so 6h bits more than 3 times 64 bitss in size) and doing thir:
(table[nlz >> 2] >> (nlz & 3)) & 7
… which is two shifts, two ANDs and one lookup. I'm fairly certain that two shifts and two additions beat that. Well, woulo beat that if everything else in the code wasn't likely to be a much bigger bottleneck that dwarfs the impact of this