Nice try. You can’t escape being known as a verb now.
Everyone knows the tool as godbolt.
Most important: this optimization enables pipelined execution.
When people talk about a CPU executing an integer add instruction in ~1 cycle, what they actually mean is that the add has this latency when the CPU pipelines are full.
If you have an 11 stage pipeline... the add can often have a latency of ~11 cycles... if you write the _right_ code for it.
Then, looking at the code it's not obvious where the infinite loop occurs.
It's amazing how many more code generation questions occur to me now that there's so much less friction in getting the answers.
#pragma omp simd reduction(+:res)
as a more precise way to achieve vectorization in the reduction (compile with -fopenmp-simd to only use it for SIMD without linking an OpenMP library): https://godbolt.org/z/17oTz1Unfortunately, the pragma is not supported with the new-style class iterators in a released compiler, though it works in clang-trunk: https://godbolt.org/z/hbP11W Note that Clang disables floating point contraction by default (so no vfmadd instructions), despite them being more accurate. One usually wants this globally (-ffp-contract=fast) except when trying to bitwise reproduce software compiled for pre-Haswell.
This was my key takeaway from this article. Writing clear code that is easier to maintain will have good enough performance most of the time. I was particularly impressed with the devirtualization optimizations and will be less likely to shy away from using polymorphism in future due to performance concerns.