Branchless Conditionals (2011) (opens in new tab)

(blueraja.com)

41 pointsdjulius11y ago23 comments

23 comments

23 comments · 6 top-level

zwieback11y ago· 5 in thread

It's interesting that ARM32 has conditional execution and I like them a lot for writing readable assembly code. Short jumps that result from a simple if can be encoded in three successive instructions, no branches.

However, it's now falling out of favor (mostly gone from ARM 64) and apparently it's due to the relative cost of putting conditional execution on the die vs. relying on smarter compilers.

vardump11y ago

I think ARM64 dropped predicates to remove excessive flags-register read ports. Nearly every instruction could read flags register and this limited core frequency (critical path), opportunities to out-of-order execution (and register renaming). Not sure though, maybe someone who knows more about OoO on ARM64 could fill me in?

brigade11y ago

It's also a huge waste of instruction encoding space! Especially given that 99% of non-branch instructions just put 1110 (AL) or can't be conditional at all (1111).

Incidentally, available register read ports are the reason why cmov takes 2 µop on Intel architectures, but only 1 on AMD, since Intel's µops can only read from two sources but cmov has 3 sources.

Then A32 could have up 4 sources with a conditional op + register shifted register source.

strictfp11y ago

Nice! How do you do this on ARM exactly?

nezza-_-11y ago

The first 4 bit of each (32 bit long..) instruction can be used to check for conditions: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

kragen11y ago

Every instruction on ARM (vanilla: no 64, no Thumb) has a condition code field.

kstenerud11y ago· 4 in thread

I know that gcc and clang both have __builtin_expect(). If you tell the compiler the more likely path, wouldn't that make the branching version faster?

Actually, I've always wondered how __builtin_expect translates to something the CPU's branch prediction engine can use...

CHY87211y ago

I think in general, most CPU architectures pick branch not taken for forward branches and branch taken for backwards branches on the first try. So I feel like builtin_expect gives weighting to let the compiler shuffle code around to make it fit that pattern.

There are ways of doing it in hardware, I remember a supervisor discussing it with respect to MIPS. I also remember them saying they went through the entire code generation stage of GCC and found that every single point at which GCC would try to use it was somewhere where it would be actively unhelpful.

astrange11y ago

__builtin_expect is good for error handling code, because gcc can avoid size-increasing optimizations in that path, and it can even move all the unlikely code out into another section so it won't take up space in your caches.

But its code generation is better for a 99%/1% case than a 60%/40% case, because Intel doesn't listen to branch hints anymore nor really give advice on how to tune for them.

dfox11y ago

On i386, CS and DS prefix bytes have no meaning but are valid for conditional jump instructions and are then used by some CPU's to provide hints to branch predictor.

As an example of another (arguably more sane) architecture, on Alpha, all branch instructions (including non-conditional ones) have bits reserved in their encoding for hints to branch predictor.

vardump11y ago

Branches can have prefixes x86 CPU can use as a hint. Modern x86 CPUs ignore these hints.

strictfp11y ago· 3 in thread

Nice article. It inspired me to look around for some more straightforward way of optimizing, and I found the setcc class of instructions: http://www.nynaeve.net/?p=178

I'm thinking that this combined with some CAS (CMPXCHG8B) could acheive the same, right?

Something like (pseudo):

Comparewith(4)

Ifequalstore(54)

Ifnotequalstore(2)

Return

vardump11y ago

I think CAS is a pretty slow operation even without a LOCK prefix. You probably don't want to use it for purposes other than intercore synchronization.

If you have a lot of data to process, using SSE/AVX is a huge win. Conditional masking and min/max instructions for example.

SIMD is a huge win especially in sorting, you can have 10-40x speed-up by using a bitonic sorting network.

1_player11y ago

Aren't setcc/cmov* instructions effectively similar to a branch? To compute the result you need to execute the previous instruction.

I suppose that these instructions do not cause the instruction pipeline to be flushed, compared to an incorrectly predicted jump, but they still stall until the previous instruction has been executed.

jmp < setcc/cmov* < branchless conditionals

Scaevolus11y ago

Conditional moves have data dependencies on their input arguments, but so do the "branchless" versions presented in the article.

chipsy11y ago· 2 in thread

One simple branchless optimization form I've used is collision detection across an array of values: instead of testing each one, i add their value to a counter(perhaps with some mapping of data to collision value). After iterating over a lot of them, I can do just one test. This is very cpu-friendly as the pipeline gets to crunch all the numbers in one go.

1_player11y ago

Sounds interesting, can you provide a code example?

chipsy11y ago

The example I was going to provide [0] it turns out was reduced into a memory comparison.

I wrote a little C program for you instead: [1]

[0] https://github.com/triplefox/three-packer/blob/master/packer...

[1] https://gist.github.com/triplefox/47d620fc556e3f7da9bb

mlindner11y ago· 2 in thread

x86 branch predictors are not 60% correct... Any decent branch predictor is over 90% correct and I believe modern ones are over 96% correct.

marcosdumay11y ago

That's completely dependent on your algorithms.

Branch prediction in some search in a hash will fail 50% of the time, however it's done. Branch prediction on a long for loop was more than 99% correct by the end of the 90's already.

Intel claims their prediction algorithms are over 96% correct on an average program, whatever that beast is. (To be fair, you'll find a definition for it at their papers. That's a perfectly legit claim, it just does not mean what you think it means.)

astrange11y ago

If your data is unpredictable, the branchless instructions save time wasted trying to predict it. If you're doing any kind of data compression, either your compressed data is unpredictable or it's not compressed enough!

Getting rid of branching in the unimportant parts of your program is good too; saves space in the cache for important branch predictions.

This is the kind of thing where it's not worth doing by hand, but I don't think compilers really do it either…

vardump11y ago· 1 in thread

One example seems a bit odd.

  if(LocalVariable & 0x00001000)
      return 1;
  else
      return 0;

  mov eax, [ebp - 10]
  and eax, 0x00001000
  neg eax
  sbb eax, eax
  neg eax
  ret

Hmm... wouldn't this be faster? Two instructions less:

  mov eax, [ebp - 10]
  and eax, 0x00001000
  shr eax, 12
  ret

Well, who knows. Didn't bother to analyze this case. Maybe the article's example is faster somehow?

astrange11y ago

The neg/sbb/neg operation is 'x = x != 0' or 'x = !!x', I think.

You're right that yours should work too, because after the and the value can only have 1 bit set. But it only works for this particular and mask.

j / k navigate · click thread line to collapse

23 comments

23 comments · 6 top-level

zwieback11y ago· 5 in thread

However, it's now falling out of favor (mostly gone from ARM 64) and apparently it's due to the relative cost of putting conditional execution on the die vs. relying on smarter compilers.

vardump11y ago

brigade11y ago

It's also a huge waste of instruction encoding space! Especially given that 99% of non-branch instructions just put 1110 (AL) or can't be conditional at all (1111).

Incidentally, available register read ports are the reason why cmov takes 2 µop on Intel architectures, but only 1 on AMD, since Intel's µops can only read from two sources but cmov has 3 sources.

Then A32 could have up 4 sources with a conditional op + register shifted register source.

strictfp11y ago

Nice! How do you do this on ARM exactly?

nezza-_-11y ago

The first 4 bit of each (32 bit long..) instruction can be used to check for conditions: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

kragen11y ago

Every instruction on ARM (vanilla: no 64, no Thumb) has a condition code field.

kstenerud11y ago· 4 in thread

I know that gcc and clang both have __builtin_expect(). If you tell the compiler the more likely path, wouldn't that make the branching version faster?

Actually, I've always wondered how __builtin_expect translates to something the CPU's branch prediction engine can use...

CHY87211y ago

astrange11y ago

But its code generation is better for a 99%/1% case than a 60%/40% case, because Intel doesn't listen to branch hints anymore nor really give advice on how to tune for them.

dfox11y ago

On i386, CS and DS prefix bytes have no meaning but are valid for conditional jump instructions and are then used by some CPU's to provide hints to branch predictor.

As an example of another (arguably more sane) architecture, on Alpha, all branch instructions (including non-conditional ones) have bits reserved in their encoding for hints to branch predictor.

vardump11y ago

Branches can have prefixes x86 CPU can use as a hint. Modern x86 CPUs ignore these hints.

strictfp11y ago· 3 in thread

Nice article. It inspired me to look around for some more straightforward way of optimizing, and I found the setcc class of instructions: http://www.nynaeve.net/?p=178

I'm thinking that this combined with some CAS (CMPXCHG8B) could acheive the same, right?

Something like (pseudo):

Comparewith(4)

Ifequalstore(54)

Ifnotequalstore(2)

Return

vardump11y ago

I think CAS is a pretty slow operation even without a LOCK prefix. You probably don't want to use it for purposes other than intercore synchronization.

If you have a lot of data to process, using SSE/AVX is a huge win. Conditional masking and min/max instructions for example.

SIMD is a huge win especially in sorting, you can have 10-40x speed-up by using a bitonic sorting network.

1_player11y ago

Aren't setcc/cmov* instructions effectively similar to a branch? To compute the result you need to execute the previous instruction.

jmp < setcc/cmov* < branchless conditionals

Scaevolus11y ago

Conditional moves have data dependencies on their input arguments, but so do the "branchless" versions presented in the article.

chipsy11y ago· 2 in thread

1_player11y ago

Sounds interesting, can you provide a code example?

chipsy11y ago

The example I was going to provide [0] it turns out was reduced into a memory comparison.

I wrote a little C program for you instead: [1]

[0] https://github.com/triplefox/three-packer/blob/master/packer...

[1] https://gist.github.com/triplefox/47d620fc556e3f7da9bb

mlindner11y ago· 2 in thread

x86 branch predictors are not 60% correct... Any decent branch predictor is over 90% correct and I believe modern ones are over 96% correct.

marcosdumay11y ago

That's completely dependent on your algorithms.

Branch prediction in some search in a hash will fail 50% of the time, however it's done. Branch prediction on a long for loop was more than 99% correct by the end of the 90's already.

astrange11y ago

Getting rid of branching in the unimportant parts of your program is good too; saves space in the cache for important branch predictions.

This is the kind of thing where it's not worth doing by hand, but I don't think compilers really do it either…

vardump11y ago· 1 in thread

One example seems a bit odd.

  if(LocalVariable & 0x00001000)
      return 1;
  else
      return 0;

  mov eax, [ebp - 10]
  and eax, 0x00001000
  neg eax
  sbb eax, eax
  neg eax
  ret

Hmm... wouldn't this be faster? Two instructions less:

  mov eax, [ebp - 10]
  and eax, 0x00001000
  shr eax, 12
  ret

Well, who knows. Didn't bother to analyze this case. Maybe the article's example is faster somehow?

astrange11y ago

The neg/sbb/neg operation is 'x = x != 0' or 'x = !!x', I think.

You're right that yours should work too, because after the and the value can only have 1 bit set. But it only works for this particular and mask.

j / k navigate · click thread line to collapse