However, it's now falling out of favor (mostly gone from ARM 64) and apparently it's due to the relative cost of putting conditional execution on the die vs. relying on smarter compilers.
Incidentally, available register read ports are the reason why cmov takes 2 µop on Intel architectures, but only 1 on AMD, since Intel's µops can only read from two sources but cmov has 3 sources.
Then A32 could have up 4 sources with a conditional op + register shifted register source.
I wrote a little C program for you instead: [1]
[0] https://github.com/triplefox/three-packer/blob/master/packer...
if(LocalVariable & 0x00001000)
return 1;
else
return 0;
mov eax, [ebp - 10]
and eax, 0x00001000
neg eax
sbb eax, eax
neg eax
ret
Hmm... wouldn't this be faster? Two instructions less: mov eax, [ebp - 10]
and eax, 0x00001000
shr eax, 12
ret
Well, who knows. Didn't bother to analyze this case. Maybe the article's example is faster somehow?You're right that yours should work too, because after the and the value can only have 1 bit set. But it only works for this particular and mask.
I'm thinking that this combined with some CAS (CMPXCHG8B) could acheive the same, right?
Something like (pseudo):
Comparewith(4)
Ifequalstore(54)
Ifnotequalstore(2)
Return
If you have a lot of data to process, using SSE/AVX is a huge win. Conditional masking and min/max instructions for example.
SIMD is a huge win especially in sorting, you can have 10-40x speed-up by using a bitonic sorting network.
I suppose that these instructions do not cause the instruction pipeline to be flushed, compared to an incorrectly predicted jump, but they still stall until the previous instruction has been executed.
jmp < setcc/cmov* < branchless conditionals
Actually, I've always wondered how __builtin_expect translates to something the CPU's branch prediction engine can use...
There are ways of doing it in hardware, I remember a supervisor discussing it with respect to MIPS. I also remember them saying they went through the entire code generation stage of GCC and found that every single point at which GCC would try to use it was somewhere where it would be actively unhelpful.
But its code generation is better for a 99%/1% case than a 60%/40% case, because Intel doesn't listen to branch hints anymore nor really give advice on how to tune for them.
As an example of another (arguably more sane) architecture, on Alpha, all branch instructions (including non-conditional ones) have bits reserved in their encoding for hints to branch predictor.
Branch prediction in some search in a hash will fail 50% of the time, however it's done. Branch prediction on a long for loop was more than 99% correct by the end of the 90's already.
Intel claims their prediction algorithms are over 96% correct on an average program, whatever that beast is. (To be fair, you'll find a definition for it at their papers. That's a perfectly legit claim, it just does not mean what you think it means.)
Getting rid of branching in the unimportant parts of your program is good too; saves space in the cache for important branch predictions.
This is the kind of thing where it's not worth doing by hand, but I don't think compilers really do it either…