Reversing Bits in C (opens in new tab)

(corner.squareup.com)

216 pointspvilchez12y ago89 comments

89 comments

57 comments · 19 top-level

applecore12y ago· 8 in thread

Interesting. What's the purpose of reversing the bits in a byte?

RodgerTheGreat12y ago

A common approach for performing a Fast Fourier Transform involves reversing the bits in time-domain samples.

stephencanon12y ago

Expanding slightly, there’s a permutation that needs to happen in order to efficiently perform a DFT in-place (and the same approach is often used even when the transform is out-of-place). For power of two sizes (one of the most common cases), that permutation is precisely the same as a bit reversal of the indices.

ajb12y ago

Reversing the bits in the index to the samples :-)

markrages12y ago

Besides the FFT addressing, I have encountered SPI devices that wanted to be addressed in LSb-first manner. Some micros let you choose which end goes out first, some don't. In the latter case you find yourself swapping bits in software.

groby_b12y ago

My guess would be FFT. IIRC, a fast implementation of that requires lots of bit reversal. My memory is rusty, though :)

quasive12y ago

Passwords for the RFB protocol (VNC) are used to DES-encrypt a challenge token sent by the server. However, each byte in the key is reversed before it is used for encryption.

Of course, this happens only once per connection so there's generally not much of a need for it to be particularly fast.

kbojody12y ago

Endian is probably the most common.

unoti12y ago

Endian-ness would be reversing bytes, not bits within bytes, like 0x1234 -> 0x3412. What we're talking about here would be more along the lines of: 0b0010001 -> 0b1000100

The most obvious application I can think of for reversing bits within a byte would be for image processing applications, such as mirroring an image horizontally, or making kaleidoscopes. There are probably signal processing applications, too...

2 more replies

jandrewrogers12y ago· 7 in thread

This article overlooks a major factor in bit-twiddling performance on modern CPUs: saturation of the execution ports in a CPU core.

An Intel i7 core has six execution ports, three of which are ALUs of various types. Depending on the specific instruction and the dependencies between instructions, the CPU can execute up to 3 simple integer operations every clock cycle mixed with operations like loads and stores at the same time. For most algorithms, particularly those that are not carefully designed, multiple execution ports may be sitting idle for a given clock cycle. (Hyper-threads work by opportunistically using these unused execution ports.)

Consequently, algorithms with a few extra operations but more operation parallelism will frequently be faster than an equivalent algorithm where the operations are necessarily serialized in the CPU.

Furthermore, the compiler and CPU may have a difficult time discerning when instructions in some algorithms can be executed in parallel across execution ports. Seemingly null changes to the implementation of such algorithms, such as using splitting the algorithm across two accumulator variables and combining them at the end when any normal programmer would just use one variable to achieve the same thing can have a large impact on performance. I once doubled the performance of a bit-twiddling algorithm simply by taking the algorithm and using three variables instead of one. The algorithm was identical but the use of three registers exposed the available parallelism to the CPU.

stephencanon12y ago

This is an excellent point, however, there are a few things to keep in mind: first, compilers can (and do) perform this optimization for you (ignoring details about re-associating floating-point since we’re talking about bit twiddling).

Second, bit-reversal never exists in a vacuum. There are other operations taking place around it, which will fill in unused execution resources, thanks to out-of-order execution. (And as you note, hyper threading will take advantage of them too).

Third, even though there are six ports (actually, 8 ports and 4 ALUs in Haswell![1]), that i7 can still only retire 4 fused uops per cycle, so in practice one thread cannot saturate all of the execution ports, no matter how cleverly it is optimized.

All of this combines to mean that the fastest bit-reversal in isolation may not be the fastest bit-reversal in situ, which is much more important. Actually evaluating that is much more complex, but it does tend to tip things away from chasing too much ILP slightly more than isolated timing does.

[1] http://www.anandtech.com/show/6355/intels-haswell-architectu...

jandrewrogers12y ago

Both GCC and Clang are surprisingly mediocre at this kind of optimization. I write a lot of extreme performance integer algorithms and those compilers only seem to find "obvious" parallel instruction schedules about half the time even in isolated contexts.

Fortunately, it is pretty simple to induce the desired optimization from the C code without resorting to much cleverness. The compilers miss these optimizations often enough that I frequently double check if I care. Still, it requires fairly detailed knowledge of the microarchitecture.

I do not do microarchitecture optimization work very often. The last time I did, it was to design a faster, better hash function to replace Google's CityHash (and the result was faster and stronger). For most codes, memory behaviors dominate with respect to performance.

2 more replies

nitrogen12y ago

For people who know what they're doing (e.g. DarkShikari), the compiler doesn't stand a chance: http://www.scribd.com/doc/137419114/Introduction-to-AVX2-opt... (via https://news.ycombinator.com/item?id=5598010)

P.S. Scribd stinks. The important numbers are on the hidden second and third pages. Do people know that Scribd is doing this to their documents? That document is CC-BY-NC -- charging money to read pages 2 and 3 or download the original is not NC.

Found the original here: http://mailman.videolan.org/pipermail/x264-devel/attachments...

2 more replies

onan_barbarian12y ago

Absolutely true.

Note that micro-fusion can allow you to push the number of uops retired per cycle to 5-6 (SNB would be 3 ALU, 2 loads, Haswell could be 4 ALU, 2 loads), as the issue/retire rates are on the fused domain.

It's a far cry from RISC - like a load/store machine.

All that being said micro-fusion only works with an ALU op and load from the same instruction and you obviously have to be careful to ensure that the load applies to the appropriate operand (tricky given the non-orthogonal, frequently 2-address form of the instructions).

1 more reply

carterschonwald12y ago

Very very good points! Relatedly: for any performance sensitive code, reading the relevant version of the Intel Optimization manual + a book like Hacker's Delight will lead to a lot of good understanding of these trick.

(admission: i'm spending a lot of my time staring at ways to make it really really easy to write fast numerical codes, so thinking about the ports on modern CPUs is very very helpful)

grn12y ago

Could you recommend a good reading (a book preferably) for learning about CPU architecture. I'm aware of Intel Software Developer's Manual. What other resources are worth reading?

comatose_kid12y ago

Computer Architecture: A Quantitative Approach by Hennesey and Patterson is the definitive text, and it is good (at least the 3rd edition I read back in 02).

One way to put the ideas you learn into practice is to try to write an optimized large NxN matrix multiplication routine. Start in C, converting the kernel to x86 code. Also disassemble the generated C code in the kernel to see what the compiler is doing. See how close you can get to the theoretical peak CPU performance.

It is fun stuff. While this kind of optimization is rarely needed, doing so (like learning lisp) will make you a better programmer.

robomartin12y ago· 5 in thread

If you've ever dealt with graphics file manipulation code chances are you've suffered the pain of changing the endian-ness of an image file. I never understood why some of these operations are not implemented as machine instructions that can run in one instruction cycle flat. There's nothing to them, I've done exactly that on FPGA's. Yes, they can be a little resource/routing intensive but not that bad.

picomancer12y ago

> changing the endian-ness

x86 has had the BSWAP instruction since the 486.

gcc has a __builtin_bswap16, __builtin_bswap32, and __builtin_bswap64 which will presumably take advantage of these built-in instructions on x86 and any other gcc-supported architectures where similar instructions exist (and fall back to a reasonably fast and well-tested multi-instruction implementation where they don't).

You should really RTFM every couple years, just to know what your processor [1] and compiler [2] can do.

[1] http://www.intel.com/content/www/us/en/processors/architectu...

[2] http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

robomartin12y ago

Oh, I RTFM. Not always working on Intel platforms though. And still:

http://hardwarebug.org/2010/01/14/beware-the-builtins/

mrb12y ago

Note that this post is about changing the order of bits within a byte (eg. x86-64 XOP instruction VPPERM with bit reversal option). Whereas you are talking about changing the order of bytes within a word (x86 instruction BSWAP).

zhemao12y ago

Wait, why is it resource intensive? If all you need to do is reverse a fixed-size integer, wouldn't you just wire the inputs to the outputs backwards?

robomartin12y ago

Depending on timing requirements, device type, operating speed and word width you have to add one or more layers of flip-flops to facilitate timing closure and avoid potential metastability issues.

1 more reply

KaeseEs12y ago· 4 in thread

Great analysis, although I'm curious how the idea that doing a bunch of 64 bit ops in order to accomplish byte arithmetic came about to begin with - was the function in question not written by a firmware guy?

zwieback12y ago

I think this one goes back to PDP days and wasn't necessarily written to be the fastest possible implementation. The PDP could do 36*36 multiply into 72 bits. Not sure how the modulo instruction performed but there was a DIV instruction.

binarymax12y ago

Down the rabbit hole says this came from HAKMEM No. 239 in 1972!

http://www.inwap.com/pdp10/hbaker/hakmem/hacks.html#item167

groby_b12y ago

Or the programmer was a firmware hacker, and she knew that RBIT is ARMv6T2, which IIRC wasn't available until the iPhone 3GS. (Not 100% certain, and I don't have my manuals handy)

stephencanon12y ago

... in which case she would have used a lookup table, unless she was a really old-school firmware hacker and still believed that you couldn’t justify 256B for the table. (FWIW, you’re right about ARMv6T2).

1 more reply

rainforest12y ago· 2 in thread

The multiplication trick reminds me of this StackOverflow answer[1] where an SMT solver (z3) is used to derive mask and multiplier to extract chosen bits from a byte.

[1] : http://stackoverflow.com/questions/14547087/extracting-bits-...

nkurz12y ago

That's really interesting, and an approach to such problems that I'd never considered. I was excited that a "Code generator for bit permutations" (http://programming.sirrida.de/calcperm.php) exists, but using a theorem prover is really another level of possibility. Now I need to figure out how to apply it to the problem I'm currently thinking about: http://stackoverflow.com/questions/17880178/how-do-i-sum-the...

pbsd12y ago

We can use the exact same approach used in the bit reversal trick of the article:

  ((x * 0x01010101) & 0xC0300C03) % 1023

This is probably not gonna be faster than the naive approach, though.

2 more replies

cnvogel12y ago· 2 in thread

Interestingly, while x86-64 does not seem to have a single opcode for reversing bits in a byte, it has a function to arbitrarily shuffle around the 16 bytes in a 128bit SSE register [PSHUFB]. It just blows my mind how much data those SIMD instructions process or move around in relatively few clock-cycles.

http://stackoverflow.com/a/9040426

http://www.intel.com/content/www/us/en/processors/architectu... (it's on page 1256 of 3251).

stephencanon12y ago

It’s actually shocking how long it took Intel to add PSHUFB to SSE. Altivec (PPC) had the even-more-powerful vperm (arbitrary shuffle mapping 32B to 16B) way back in 1999.

chacham1512y ago

The VAX (circa 1977) had polynomial evaluation as an instruction[1]. What is your point?

[1] http://en.wikipedia.org/wiki/VAX

2 more replies

daniel-cussen12y ago· 2 in thread

In the GA144, lookup tables are pretty painful, so the way I implement reverse there is:

reverse: a! 16 push . 2 dup . . begin +x 2* 2* unext +x 2* a . + nip ;

In Intel x86/64, the fastest way I know of is to use SIMD instructions, and break the 64-bit word into 16 nibbles (4-bit pieces), and use PSHUFB to perform a parallel lookup against another 128-bit xmm register. Then you aggregate the nibbles in reverse order, using inclusive or and variants of the shuffle instruction.

keenerd12y ago

This does an 18 bit word, right?

daniel-cussen12y ago

Yep. I thought this would be a huge issue when using this, but first, it's really necessary for the instruction set, and second, a lot of hardware uses 18-bit, including FPGA's (often packed w/ 18x18 multipliers and 18bit SRAMs, in order to support 8b/10b SERDES) and 72-bit DDR3.

kibwen12y ago· 2 in thread

"Intel x86/x64 processors don’t have this instruction, so this is definitely not a portable solution."

This stuck out to me. I know that RISC vs CISC is basically a meaningless distinction nowadays, but I still naively expected that x86 would be more-or-less a strict superset of ARM.

pbsd12y ago

Strictly speaking, AMD's XOP extensions do have an instruction that is close enough: VPPERM. It allows to not only shuffle bytes, like the already mentioned PSHUFB, but also reverse bits within each byte. Therefore, a single VPPERM instruction can reverse up to 128 bits at a time.

stephencanon12y ago

Modern ARM has lots of instructions that don’t have direct x86 equivalents. Most are in the vector domain, but there are plenty of non-vector examples too: BFI, BFC, BIC, ORN, RSB, saturating arithmetic, numerous multiply-add variants, etc.

Symmetry12y ago· 2 in thread

Very interesting, though you shouldn't be surprised by small differences between O(1) and O(N) algorithms when N is only 8.

stephencanon12y ago

If N is 8, then O(N) is O(1). For that matter, so is O(f(N)), for any function f.

MichaelBurge12y ago

Is that true? I would agree that the time is bounded by a constant, but Big O only makes sense at all as the size of the input increases without bound.

1 more reply

munificent12y ago· 2 in thread

> That’s one mathematical operation, but a large number of CPU instructions. CPU instructions are what matter here, though, as we see, not as much as cache coherency.

I thought it was also a single CPU instruction, but multiple clock cycles.

wnissen12y ago

It's not a single hardware instruction on a 32-bit CPU, I believe is the point.

gsg12y ago

A lot of CPUs don't have a division instruction at all, let alone 64 bit division.

twoodfin12y ago· 1 in thread

The article makes the point that the lookup table version is fast because the table fits in D$, and that if the table were evicted it would be slower. This is true, but the more interesting point is that by loading this table into D$, you're potentially slowing down other operations.

It's an important conundrum of optimization that if you had 20 similarly complex functions in a critical path, implementing and benchmarking each individually with a lookup table could show excellent performance while globally performance is terrible. And worse, it's uniformly terrible, with no particular function seeming to be consuming an inordinate amount of the runtime or, for that matter, D$ misses.

stephencanon12y ago

If you had 20 similar functions, the tables would occupy 5k in total, using only 1/6th of the L1 D$ on a typical "big" CPU. In actuality, temporal locality is such that you don't often stride through all table entries uniformly, so the actual cache pressure is even less.

The point that you're going after is a good one, but its important to keep in mind how enormous modern memory hierarchies are. It often is very reasonable to trade memory and cache pressure for speed.

chacham1512y ago· 1 in thread

The difference here between the obvious method and the best method is 55ns. Is there a reason that this problem deserves this much attention for as little a difference in time? (I realize that it is 6.5x more, but if it isnt at the center of some core loop, the multiplicative factor doesnt really matter). What use cases are there for this?

Renaud12y ago

I suppose the point was to show that it's pretty bad to resort to copy/pasting clever bit hacks into libraries without taking care of how they work.

The fact that the code isn't necessarily obvious makes me think that whoever used it was hoping for an optimisation of sorts.

Terseness can lead to obfuscation, and that's the wrong sort of optimisation. So we can hope that the developer was going for speed instead, but the results show that was a huge failure.

Maybe this won't affect performance in this particular library, maybe it's called once or twice and it doesn't matter, but if this is part of the innards of a game or a cryptographic function or some low-level network stack, it could have very detrimental consequences on performance.

mgraczyk12y ago

The author seems to misunderstand the idea of asymptotic complexity. All of the reversal operations are O(1) because the number of bits being flipped is a constant. If he were concerned with flipping the bits in an arbitrary precision number, then his different solutions might deserve "Big-O" classifications.

Second point: The reason that the original solution is slow is because a mod operation by a number that is not a power of two involves a floating point divide, or several multiply accumulates at extended precision. Either of those two operations are slower than any of the other methods.

Scaevolus12y ago

I'm glad they noted that the lookup table's speed relies on it being in cache, which most "benchmark magic bit-fiddling operations" posts ignore. (Although it's temporal locality, not cache coherence, that's important for this.)

_ihaque12y ago

Along the same vein, Andrew Dalke wrote up an interesting series of blog posts benchmarking different implementations of population count (counting the number of set bits in a word):

http://dalkescientific.com/writings/diary/archive/2008/07/03...

http://dalkescientific.com/writings/diary/archive/2008/07/05...

http://dalkescientific.com/writings/diary/archive/2011/11/02...

The Stanford Bit Hacks page linked in the original article is also very interesting reading for folks into this sort of stuff.

fjarlq12y ago

A great companion to this sort of thing is the book Hacker's Delight by Henry S. Warren, Jr:

http://www.hackersdelight.org/

http://www.amazon.com/Hackers-Delight-2nd-Edition-ebook/dp/B...

mzs12y ago

Oh man this is one of my nits. I've written code like this. For example in some bit counting code I have a block comment in front of all that with 57 lines that are not blank. I have a copy of Hacker's Delight on my bookshelf, but will the person after me know what and how that code works? I really hope that there was a comment before that pointing to one of the hack web pages at least.

barbs12y ago

Ack! Light grey on white background! My eyes!! Seriously, that's really annoying.

duedl0r12y ago

Why on earth does this article have so many upvotes? Running time analysis is completely wrong... O(n) vs O(1) and such...tss..don't get me started..

j / k navigate · click thread line to collapse

89 comments

57 comments · 19 top-level

applecore12y ago· 8 in thread

Interesting. What's the purpose of reversing the bits in a byte?

RodgerTheGreat12y ago

A common approach for performing a Fast Fourier Transform involves reversing the bits in time-domain samples.

stephencanon12y ago

ajb12y ago

Reversing the bits in the index to the samples :-)

markrages12y ago

groby_b12y ago

My guess would be FFT. IIRC, a fast implementation of that requires lots of bit reversal. My memory is rusty, though :)

quasive12y ago

Passwords for the RFB protocol (VNC) are used to DES-encrypt a challenge token sent by the server. However, each byte in the key is reversed before it is used for encryption.

Of course, this happens only once per connection so there's generally not much of a need for it to be particularly fast.

kbojody12y ago

Endian is probably the most common.

unoti12y ago

Endian-ness would be reversing bytes, not bits within bytes, like 0x1234 -> 0x3412. What we're talking about here would be more along the lines of: 0b0010001 -> 0b1000100

2 more replies

jandrewrogers12y ago· 7 in thread

This article overlooks a major factor in bit-twiddling performance on modern CPUs: saturation of the execution ports in a CPU core.

Consequently, algorithms with a few extra operations but more operation parallelism will frequently be faster than an equivalent algorithm where the operations are necessarily serialized in the CPU.

stephencanon12y ago

[1] http://www.anandtech.com/show/6355/intels-haswell-architectu...

jandrewrogers12y ago

2 more replies

nitrogen12y ago

Found the original here: http://mailman.videolan.org/pipermail/x264-devel/attachments...

2 more replies

onan_barbarian12y ago

Absolutely true.

It's a far cry from RISC - like a load/store machine.

1 more reply

carterschonwald12y ago

(admission: i'm spending a lot of my time staring at ways to make it really really easy to write fast numerical codes, so thinking about the ports on modern CPUs is very very helpful)

grn12y ago

Could you recommend a good reading (a book preferably) for learning about CPU architecture. I'm aware of Intel Software Developer's Manual. What other resources are worth reading?

comatose_kid12y ago

Computer Architecture: A Quantitative Approach by Hennesey and Patterson is the definitive text, and it is good (at least the 3rd edition I read back in 02).

It is fun stuff. While this kind of optimization is rarely needed, doing so (like learning lisp) will make you a better programmer.

robomartin12y ago· 5 in thread

picomancer12y ago

> changing the endian-ness

x86 has had the BSWAP instruction since the 486.

You should really RTFM every couple years, just to know what your processor [1] and compiler [2] can do.

[1] http://www.intel.com/content/www/us/en/processors/architectu...

[2] http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

robomartin12y ago

Oh, I RTFM. Not always working on Intel platforms though. And still:

http://hardwarebug.org/2010/01/14/beware-the-builtins/

mrb12y ago

zhemao12y ago

Wait, why is it resource intensive? If all you need to do is reverse a fixed-size integer, wouldn't you just wire the inputs to the outputs backwards?

robomartin12y ago

Depending on timing requirements, device type, operating speed and word width you have to add one or more layers of flip-flops to facilitate timing closure and avoid potential metastability issues.

1 more reply

KaeseEs12y ago· 4 in thread

zwieback12y ago

binarymax12y ago

Down the rabbit hole says this came from HAKMEM No. 239 in 1972!

http://www.inwap.com/pdp10/hbaker/hakmem/hacks.html#item167

groby_b12y ago

Or the programmer was a firmware hacker, and she knew that RBIT is ARMv6T2, which IIRC wasn't available until the iPhone 3GS. (Not 100% certain, and I don't have my manuals handy)

stephencanon12y ago

1 more reply

rainforest12y ago· 2 in thread

The multiplication trick reminds me of this StackOverflow answer[1] where an SMT solver (z3) is used to derive mask and multiplier to extract chosen bits from a byte.

[1] : http://stackoverflow.com/questions/14547087/extracting-bits-...

nkurz12y ago

pbsd12y ago

We can use the exact same approach used in the bit reversal trick of the article:

  ((x * 0x01010101) & 0xC0300C03) % 1023

This is probably not gonna be faster than the naive approach, though.

2 more replies

cnvogel12y ago· 2 in thread

http://stackoverflow.com/a/9040426

http://www.intel.com/content/www/us/en/processors/architectu... (it's on page 1256 of 3251).

stephencanon12y ago

It’s actually shocking how long it took Intel to add PSHUFB to SSE. Altivec (PPC) had the even-more-powerful vperm (arbitrary shuffle mapping 32B to 16B) way back in 1999.

chacham1512y ago

The VAX (circa 1977) had polynomial evaluation as an instruction[1]. What is your point?

[1] http://en.wikipedia.org/wiki/VAX

2 more replies

daniel-cussen12y ago· 2 in thread

In the GA144, lookup tables are pretty painful, so the way I implement reverse there is:

reverse: a! 16 push . 2 dup . . begin +x 2* 2* unext +x 2* a . + nip ;

keenerd12y ago

This does an 18 bit word, right?

daniel-cussen12y ago

kibwen12y ago· 2 in thread

"Intel x86/x64 processors don’t have this instruction, so this is definitely not a portable solution."

This stuck out to me. I know that RISC vs CISC is basically a meaningless distinction nowadays, but I still naively expected that x86 would be more-or-less a strict superset of ARM.

pbsd12y ago

stephencanon12y ago

Symmetry12y ago· 2 in thread

Very interesting, though you shouldn't be surprised by small differences between O(1) and O(N) algorithms when N is only 8.

stephencanon12y ago

If N is 8, then O(N) is O(1). For that matter, so is O(f(N)), for any function f.

MichaelBurge12y ago

Is that true? I would agree that the time is bounded by a constant, but Big O only makes sense at all as the size of the input increases without bound.

1 more reply

munificent12y ago· 2 in thread

> That’s one mathematical operation, but a large number of CPU instructions. CPU instructions are what matter here, though, as we see, not as much as cache coherency.

I thought it was also a single CPU instruction, but multiple clock cycles.

wnissen12y ago

It's not a single hardware instruction on a 32-bit CPU, I believe is the point.

gsg12y ago

A lot of CPUs don't have a division instruction at all, let alone 64 bit division.

twoodfin12y ago· 1 in thread

stephencanon12y ago

chacham1512y ago· 1 in thread

Renaud12y ago

I suppose the point was to show that it's pretty bad to resort to copy/pasting clever bit hacks into libraries without taking care of how they work.

The fact that the code isn't necessarily obvious makes me think that whoever used it was hoping for an optimisation of sorts.

Terseness can lead to obfuscation, and that's the wrong sort of optimisation. So we can hope that the developer was going for speed instead, but the results show that was a huge failure.

mgraczyk12y ago

Scaevolus12y ago

_ihaque12y ago

Along the same vein, Andrew Dalke wrote up an interesting series of blog posts benchmarking different implementations of population count (counting the number of set bits in a word):

http://dalkescientific.com/writings/diary/archive/2008/07/03...

http://dalkescientific.com/writings/diary/archive/2008/07/05...

http://dalkescientific.com/writings/diary/archive/2011/11/02...

The Stanford Bit Hacks page linked in the original article is also very interesting reading for folks into this sort of stuff.

fjarlq12y ago

A great companion to this sort of thing is the book Hacker's Delight by Henry S. Warren, Jr:

http://www.hackersdelight.org/

http://www.amazon.com/Hackers-Delight-2nd-Edition-ebook/dp/B...

mzs12y ago

barbs12y ago

Ack! Light grey on white background! My eyes!! Seriously, that's really annoying.

duedl0r12y ago

Why on earth does this article have so many upvotes? Running time analysis is completely wrong... O(n) vs O(1) and such...tss..don't get me started..

j / k navigate · click thread line to collapse