Of course, this happens only once per connection so there's generally not much of a need for it to be particularly fast.
The most obvious application I can think of for reversing bits within a byte would be for image processing applications, such as mirroring an image horizontally, or making kaleidoscopes. There are probably signal processing applications, too...
An Intel i7 core has six execution ports, three of which are ALUs of various types. Depending on the specific instruction and the dependencies between instructions, the CPU can execute up to 3 simple integer operations every clock cycle mixed with operations like loads and stores at the same time. For most algorithms, particularly those that are not carefully designed, multiple execution ports may be sitting idle for a given clock cycle. (Hyper-threads work by opportunistically using these unused execution ports.)
Consequently, algorithms with a few extra operations but more operation parallelism will frequently be faster than an equivalent algorithm where the operations are necessarily serialized in the CPU.
Furthermore, the compiler and CPU may have a difficult time discerning when instructions in some algorithms can be executed in parallel across execution ports. Seemingly null changes to the implementation of such algorithms, such as using splitting the algorithm across two accumulator variables and combining them at the end when any normal programmer would just use one variable to achieve the same thing can have a large impact on performance. I once doubled the performance of a bit-twiddling algorithm simply by taking the algorithm and using three variables instead of one. The algorithm was identical but the use of three registers exposed the available parallelism to the CPU.
Second, bit-reversal never exists in a vacuum. There are other operations taking place around it, which will fill in unused execution resources, thanks to out-of-order execution. (And as you note, hyper threading will take advantage of them too).
Third, even though there are six ports (actually, 8 ports and 4 ALUs in Haswell![1]), that i7 can still only retire 4 fused uops per cycle, so in practice one thread cannot saturate all of the execution ports, no matter how cleverly it is optimized.
All of this combines to mean that the fastest bit-reversal in isolation may not be the fastest bit-reversal in situ, which is much more important. Actually evaluating that is much more complex, but it does tend to tip things away from chasing too much ILP slightly more than isolated timing does.
[1] http://www.anandtech.com/show/6355/intels-haswell-architectu...
Fortunately, it is pretty simple to induce the desired optimization from the C code without resorting to much cleverness. The compilers miss these optimizations often enough that I frequently double check if I care. Still, it requires fairly detailed knowledge of the microarchitecture.
I do not do microarchitecture optimization work very often. The last time I did, it was to design a faster, better hash function to replace Google's CityHash (and the result was faster and stronger). For most codes, memory behaviors dominate with respect to performance.
For people who know what they're doing (e.g. DarkShikari), the compiler doesn't stand a chance: http://www.scribd.com/doc/137419114/Introduction-to-AVX2-opt... (via https://news.ycombinator.com/item?id=5598010)
P.S. Scribd stinks. The important numbers are on the hidden second and third pages. Do people know that Scribd is doing this to their documents? That document is CC-BY-NC -- charging money to read pages 2 and 3 or download the original is not NC.
Found the original here: http://mailman.videolan.org/pipermail/x264-devel/attachments...
Note that micro-fusion can allow you to push the number of uops retired per cycle to 5-6 (SNB would be 3 ALU, 2 loads, Haswell could be 4 ALU, 2 loads), as the issue/retire rates are on the fused domain.
It's a far cry from RISC - like a load/store machine.
All that being said micro-fusion only works with an ALU op and load from the same instruction and you obviously have to be careful to ensure that the load applies to the appropriate operand (tricky given the non-orthogonal, frequently 2-address form of the instructions).
(admission: i'm spending a lot of my time staring at ways to make it really really easy to write fast numerical codes, so thinking about the ports on modern CPUs is very very helpful)
One way to put the ideas you learn into practice is to try to write an optimized large NxN matrix multiplication routine. Start in C, converting the kernel to x86 code. Also disassemble the generated C code in the kernel to see what the compiler is doing. See how close you can get to the theoretical peak CPU performance.
It is fun stuff. While this kind of optimization is rarely needed, doing so (like learning lisp) will make you a better programmer.
x86 has had the BSWAP instruction since the 486.
gcc has a __builtin_bswap16, __builtin_bswap32, and __builtin_bswap64 which will presumably take advantage of these built-in instructions on x86 and any other gcc-supported architectures where similar instructions exist (and fall back to a reasonably fast and well-tested multi-instruction implementation where they don't).
You should really RTFM every couple years, just to know what your processor [1] and compiler [2] can do.
[1] http://www.intel.com/content/www/us/en/processors/architectu...
[1] : http://stackoverflow.com/questions/14547087/extracting-bits-...
((x * 0x01010101) & 0xC0300C03) % 1023
This is probably not gonna be faster than the naive approach, though.http://stackoverflow.com/a/9040426
http://www.intel.com/content/www/us/en/processors/architectu... (it's on page 1256 of 3251).
reverse: a! 16 push . 2 dup . . begin +x 2* 2* unext +x 2* a . + nip ;
In Intel x86/64, the fastest way I know of is to use SIMD instructions, and break the 64-bit word into 16 nibbles (4-bit pieces), and use PSHUFB to perform a parallel lookup against another 128-bit xmm register. Then you aggregate the nibbles in reverse order, using inclusive or and variants of the shuffle instruction.
This stuck out to me. I know that RISC vs CISC is basically a meaningless distinction nowadays, but I still naively expected that x86 would be more-or-less a strict superset of ARM.
I thought it was also a single CPU instruction, but multiple clock cycles.
It's an important conundrum of optimization that if you had 20 similarly complex functions in a critical path, implementing and benchmarking each individually with a lookup table could show excellent performance while globally performance is terrible. And worse, it's uniformly terrible, with no particular function seeming to be consuming an inordinate amount of the runtime or, for that matter, D$ misses.
The point that you're going after is a good one, but its important to keep in mind how enormous modern memory hierarchies are. It often is very reasonable to trade memory and cache pressure for speed.
The fact that the code isn't necessarily obvious makes me think that whoever used it was hoping for an optimisation of sorts.
Terseness can lead to obfuscation, and that's the wrong sort of optimisation. So we can hope that the developer was going for speed instead, but the results show that was a huge failure.
Maybe this won't affect performance in this particular library, maybe it's called once or twice and it doesn't matter, but if this is part of the innards of a game or a cryptographic function or some low-level network stack, it could have very detrimental consequences on performance.
Second point: The reason that the original solution is slow is because a mod operation by a number that is not a power of two involves a floating point divide, or several multiply accumulates at extended precision. Either of those two operations are slower than any of the other methods.
http://dalkescientific.com/writings/diary/archive/2008/07/03...
http://dalkescientific.com/writings/diary/archive/2008/07/05...
http://dalkescientific.com/writings/diary/archive/2011/11/02...
The Stanford Bit Hacks page linked in the original article is also very interesting reading for folks into this sort of stuff.
http://www.hackersdelight.org/
http://www.amazon.com/Hackers-Delight-2nd-Edition-ebook/dp/B...