The post compares (1) loop code that can be vectorized, as loop rounds are independent and do not depend on the result from the previous round, and (2) an "optimization" that makes calculations shorter, but also makes each loop round depend on the result of the previous round, so this cannot be vectorized.
The first version is simple. The second version is more complicated. The simplicity is because the first version can be represented as a formula; imagine trying to write out the second version as a formula, and the complexity becomes obvious.
(The addition loop would have to be written as a recurrence relation, which of course means every step depends on the previous step.)
Complexity isn’t always obvious. It’s sometimes common in C programs to write a loop like dst[index++] = src[i], particularly in deeply nested for-loops. In my experience it’s almost always worth rewriting it so that the indices are computable entirely from the loop iteration variables, with no state. It helps you build a mental map of the operations, because you can think of it in terms of geometry (memcpy = copying rectangles onto other larger rectangles) whereas when the index is stateful it becomes quite a lot harder to visualize in higher dimensions. At least for me.
We’ve been building a compiler at Groq for our custom ML hardware. (Roughly, “turn pytorch into our ISA with no extra programmer effort.”) I used to think of posts like this as “Well, modern compilers are so fast, who really cares about autovectorization?” — it turns out you care when you need to write your own compiler. :)
MLIR is pretty cool. It makes a lot of transformations like this pretty easy. The MLIR “krnl” dialect can also automatically transform nested loops into tiled iteration. Graphics devs will know what I mean — no need to loop over 8x8 blocks of pixels manually, just write “for x in range(width): for y in range(height): …” and set the block size to 8.
I had hoped that functional programming languages would lead us to automatic high-level parallelization. That hasn't happened much in the real world. But what we got is "write code like a functional programmer would even in a low-level language and the compiler and the ISA will parallelize it for you." That surprises me, but I'll take it.
How is MLIR different from LLVM IR?
And tile/block processing is not that novel for generic compute, graphics use it for other reasons - mostly how the pixel data is in memory and how the pipeline scales with tile size.
Reminded me of when I tried to optimize some code using assembly on x86, maybe 8-10 years go. The compiler spit out a DIV [EDI] instruction which was inside a hot loop. I saw how I could easily rewrite it to use a register instead of the indirect memory access.
So I did, and it was significantly slower, like 10-15%... I even took care to align the loop target address and so on. If I changed my code to use the memory indirection it was measurably faster.
And with that I decided hand-optimizing x86 assembly was definitely a skill to leave behind.
x86 is fundamentally a CISC; if you treat it like a RISC, it will definitely disappoint.
(Submitted title was "Multiplications and 2 additions are faster than 2 additions")
For the first version w/ AVX I get:
$ perf stat ./a.out [-] Took: 225634 ns.
Performance counter stats for './a.out':
247.34 msec task-clock:u # 0.998 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
2,009 page-faults:u # 8.122 K/sec
960,495,151 cycles:u # 3.883 GHz
2,125,347,630 instructions:u # 2.21 insn per cycle
62,572,806 branches:u # 252.982 M/sec
3,072 branch-misses:u # 0.00% of all branches
4,764,794,900 slots:u # 19.264 G/sec
2,298,312,834 topdown-retiring:u # 48.2% retiring
37,370,940 topdown-bad-spec:u # 0.8% bad speculation
186,854,701 topdown-fe-bound:u # 3.9% frontend bound
2,242,256,423 topdown-be-bound:u # 47.1% backend bound
0.247734256 seconds time elapsed
0.241338000 seconds user
0.004943000 seconds sys
For the second version with SSE and the data dependency I get:$ perf stat ./a.out [-] Took: 955104 ns.
Performance counter stats for './a.out':
975.30 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
2,010 page-faults:u # 2.061 K/sec
4,031,519,362 cycles:u # 4.134 GHz
3,400,341,362 instructions:u # 0.84 insn per cycle
200,073,542 branches:u # 205.140 M/sec
3,192 branch-misses:u # 0.00% of all branches
20,091,613,665 slots:u # 20.600 G/sec
3,110,995,283 topdown-retiring:u # 15.5% retiring
236,371,925 topdown-bad-spec:u # 1.2% bad speculation
236,371,925 topdown-fe-bound:u # 1.2% frontend bound
16,546,034,782 topdown-be-bound:u # 82.2% backend bound
0.975762759 seconds time elapsed
0.967603000 seconds user
0.004937000 seconds sys
As you can see the first version gets nearly 3x better IPC (2.21 vs 0.84) and spends half as much time being backend bound.No, just because it's using SSE doesn't mean it's vectorized. SSE has both scalar and vector instructions.
SIMD is very powerful, and modern compilers can sometimes simd-ify your code automatically.
-------
The second code can probably become SIMD as well, but it's beyond GCC's ability to autovectorizer it in that form. I kinda want to give it a go myself but don't have time today...
Autovectorizers are mysterious, often harder to see and use than explicitly SIMD code (like OpenCL or CUDA).
EDIT: Solvable by a human. I dunno much about compilers though, so I dunno if there's some kind of language issue that would prevent the unrolling of this dependency.
Usually, for floating point operations the compiler simply has no chance to do anything clever. NaNs, infinities and signed zero mean that even the most "obvious" identities don't actually hold. For example, x + 0 == x (where the == is bitwise) does not hold for x = -0.
But when operating on floating point numbers, == is not bitwise, and -0 == +0.
Check out data oriented design if you aren’t already familiar.
If the compiler says autovectorization failed, you rewrite the code until the autovectorizer works.
https://docs.microsoft.com/en-us/cpp/build/reference/qvec-re...
You can write code using explicit vectors and it’s not too bad, though less cross platform than you’d like.
An actual answer involves things like restrict and alignment keywords.
Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other, you can vectorize to SIMD-width 8.
Or in other words: i+7 can depend on i-1 no problems.
There is also "parallel" adds (addpd).
Carefully look at the assembly language, the 1st version uses parallel adds (addpd) and parallel multiplies. The 2nd version uses scalar adds (addsd)
The other major point is that the 2nd version uses a singular move qword (64-bit) per loop iteration, while the 1st version is using the full 128-bit move per loop iteration.
---------
SSE is used for scalar double-precision these days, because scalar-SSE is faster than x87 instructions... and better matches the standards (x87 had "higher precision" than the IEEE specs, so it has different results compared to other computers. SSE is closer to the specs)
That's not the tldr. It's actually more wrong than right.
The actual TLDR is: loop dependencies prohibit the compiler _and_ your CPU from parallelizing. The SIMD part is the compiler, the more important part here is the CPU though.
Long explanation:
Let's start with the compiler. Yes, the first version can be vectorized and you can see that in the included screenshots of the assembly output. The use of addpd [0] (add 2 pairs of 64bit doubles in simultaneously) instead of addsd (add 1 pair of 64bit doubles). Both operations have identical throughput/latency on any architecture not too ancient (e.g. check information of the corresponding intrinsic here [3]. Unfortunately Agner instruction_tables doesn't include the ADDSD for most architectures.) So while you can crunch 2 operations using SIMD at the same time, you are still doing 3 multiplications and 2 additions instead of 2 additions. The multiplications and addition have the same throughput at least on Skylake. So we can use simple math and 5/2 is still more than 2!
Now to the CPU part. The loop dependency prohibits the compiler to properly utilize instruction pipelining [4] and now differences in throughput vs latency come into play. In a pipelined scenario the CPU can work on multiple steps of the instruction cycle in parallel (from fetching the instruction, then decoding it, to executing the operation and eventually writing the result back to the register) and thus achieve higher throughput. However, if your next instruction depends on the instruction before the CPU has to finish all the steps of a pipeline from instruction start to finish before the next instruction can start. This is the latency of an instruction. For example mulsd/mulpd have 8 times higher latency than throughput on Intel Skylake.
So while SIMD comes in at a factor of 2, the loop dependency stops the CPU from realizing a potential factor of 8.
Peter Cordes also correctly points this out in his comments and answer to the question on stackoverflow.
[1] https://www.felixcloutier.com/x86/addpd
[2] https://www.felixcloutier.com/x86/addsd
[3] https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
I feel like last 110-15 years (majority of) people have stopped thinking about specific CPU and only think about ISA. That works for a lot of workloads but in recent years I have observed that there is more and more interest in how specific CPU can execute code as efficiently as possible.
If you're interested in the kind of optimizations performed in the example you should check out polyhedral compilation (https://polyhedral.info/) and halide (https://halide-lang.org/). Both can be used to speed up certain workloads significantly.
double A4 = A+A+A+A;
double Z = 3A+B;
double Y1 = C;
double Y2 = A+B+C;
int i;
// ... setup unroll when LEN is odd...
for(i=0; i<LEN; i++) {
data[i] = Y1;
data[++i] = Y2;
Y1 += Z;
Y2 += Z;
Z += A4;
}
Probably not entirely functional as written, but you get the idea: unroll the loop so that the data dependent paths can each be done in parallel. For the machine being considered, a 4 step unroll should achieve maximum performance, but of course, you get all the fun things that come with hard-coding the architecture in your software.Just because the algebra works it doesn't guarantee that the code does, too.
E.g. if you use doubles for the range of 32bit integers even adding 1.0 2 billion times to 2 billion still ends up at 4 billion. Even adding 1.0 20 billion times to 20 billion ends up at 40 billion. Now adding 0.1 20 billion times to 2 billion ends up 1908 short on my CPU, i.e. about 19080 absorptions/rounding errors occured. You need some serious differences and amount of operations to actually trigger errors.
As a performance rule on modern processors: avoid using the result of a calculation as long as you reasonably can (in tight loops... You don't want to be out of cache.).
Have fun threading the needle!
The power consumption is a good question.
You'd need to parallelize it explicitly (which can be done by just unrolling the loop).
Second: why are so many people insisting that the loop is auto-vectorised? Is there any evidence to that? Data dependencies alone explain the observed performance delta. Auto-vectorization would have resulted in a higher speedup.
The suggested "slower" optimisation does fundamentally use less instructions. Chucking more hardware at parallelisable problems makes it run faster but does not necessarily reduce the power requirements much because there are fundamentally the same number of instructions, it's just gobbling up the same power over a shorter period of time - The "slower" serial algorithm uses less instructions and in theory less power in total. Disclaimer: I mean power vs speed fundamentally in theory, in practice there may be no measurable difference depending on the particular CPU and code.
While it might sound intuitive that SIMD instructions consume more power, I don't think that's necessarily true to a relevant degree in practice. My understanding is CPU power consumption is mostly tied to inefficiences that cause energy loss via heat, while the actual computation doesn't consume any energy per se. So electrons traveling a more complex path probably cause somehwat more energy loss as there is more wire/transistors to pass. But most of the total loss doesn't actually occure in the ALU. Empirically from what you can see operating systems do, the most effective way of consuming less power is actually running on a slower clock cycle and the most effective way to achieve that is getting work done faster and that's not tied to the number of instructions.
The Stackoverflow question here [1] seems to suggest that SIMD vs no SIMD has a neglectable overhead compared to entering a lower power state sooner.
[1] https://stackoverflow.com/questions/19722950/do-sse-instruct...
> SIMD vs no SIMD has a neglectable overhead compared to entering a lower power state sooner.
Yes on the same CPU as I suggested, real world difference may be unmeasurable. However note that this particular case is interesting because it's not comparing fewer serial multiplies to more SIMD multiplies, it's comparing SIMD multiplies to no multiplies but with a serial constraint due to variable dependence... i.e it's SIMD vs no multiply without any other difference in number or type of ALU ops... which again could make no difference on a big x86 in practice, but it would be interesting to know.
All of this changes if you are coding for a lower power device and have choice of hardware.
Even in this example, which apparently has 4x vector instructions vs 1x scalar, AVX-512 (and probably even AVX2) would reduce energy usage because they can do 8 (or 4 for AVX2) 64-bit elements per cycle.
Good point, decoding and scheduling are expensive and SIMD certainly eliminates a lot of that. However the alternative algorithm has even less decoding and scheduling to do, since it completely eliminates the multiplication operations without increasing the number of additions. But even then I wouldn't be surprised that it makes no difference to energy on any x86, as I said it was more of a fundamental observation - for a different application this is useful when selecting the actual hardware if you are not already constrained to a particular chip.
The whole point of this thread is realization of opposite. Slower algo executes only 2 instructions in a loop, but second one directly depends on the result of the first (induces pipeline stall) while the fast version brute forces a ton of instructions at full CPU IPC.
I.e. the CPU doesn't use much less power if most ALUs are idle, so you might as well use all of them to compute redundant results -> and then turn ~everything off sooner.
And how far would an optimiser typically take its analysis? For example, if B was defined inside the loop as (non -const) A * 2 ? Or as A * f() , where f() always returns 2, maybe using a constant or maybe a calculation etc.
Seems like a very hard problem,
Is anyone aware of good tooling for automatically catching this sort of thing even some of the time?
And... the two codes runs with exactly the same speed, on M1 Mac, with go.
edit: of course, with go, you can manually parallelize the faster option with goroutines... but that does something else, doesn't it. (and it's 500x faster.)
(Super low energy processors, battery powered, etc?
This isn't actually taking advantage of speculative execution that much. The only speculation here would be in the predicting the loop repeats, which loop unrolling would mostly negate for CPUs that don't do speculative execution.
The data dependency issue, however, would still be a punishing factor. You'd need a CPU that isn't superscalar, which does exist but is increasingly less common (even 2014's Cortex-M7 was superscalar, although it kinda sounds like ARM backed off on that for later Cortex M's?)
Also many low-end / embedded CPUs that are in-order will still do branch prediction.
The general advice when I worked with CG was to use deltas in iterations.
I'm taking the definition of simple and complex (or complected) from this talk[0]. In short: Simple means individual pieces are standing on their own, not necessarily easy or minimal. Complex means individual pieces are braided together into a whole.
The first code is simpler than the second. Just have a look at the two solutions and you see what I mean: The individual instructions in the first stand on their own and clearly declare what they mean as well. The second example is more complex, because each line/subexpression cannot be understood in isolation, there is an implicit coupling that requires the programmer _and_ the machine code interpreter to understand the whole thing for it to make sense. The CPU apparently fails at that and cannot optimize the machine instructions into more efficient microcode.
The example illustrates performance benefits of simpler code, but there are others too:
For example a type checker might be able to infer things from simpler code, but not from complex code. Again, complexity is about coupling, which can for example arise from making assumptions about things that are out of bounds or logic that happens in some other part of your program, which _this_ part relies on.
Things like these can either be outright rejected by a type checker or simply ignored and sometimes we can provide additional ceremony to make it happy. But there is always an underlying question when these issues arise: Can I express this in a simpler way? It's sometimes possible to describe rich semantics with simpler, more primitive types and explicit coordination.
Another thing that comes to mind are rich ORMs. They can be very convenient for common cases. But they have footguns for the general case, because they _complect_ so many things, such as validation, storage, domain rules, caching etc. And they are leaky abstractions because we have to do certain things in certain ways to avoid bloated SQL, N+1 etc.
They are not simple (and certainly not declarative) so we program against a black box. There are simpler "ORMs" that I very much like, but they typically only provide a minimal set of orthogonal features, such as query builders, and mapping results into plain data structures. It is simpler to use these kind of orthogonal tools.
Last but not least: Simple code can be pulled apart and changed more easily without having to worry about introducing problems. The substitution rule or referential transparency enables this by guaranteeing that each expression can be replaced by the value it will evaluate to. This also implies that other expressions that evaluate to the same value can be substituted freely.
[0] Simple Made Easy - Rich Hickey at Strangeloop 2011 https://www.youtube.com/watch?v=LKtk3HCgTa8
>> this is a perfect example of how simpler code should be preferred _by default_ over complex code.
Extra emphasis on *by default*.
You are both on the same side of the coin. This is the essence of "premature optimization is the root of all evil". Simplicity should be the default. Once you know your implementation is solid, then you can go back and optimize.
TFA is quite contrived, but imagine a much more complex process. You see some code:
for i in loop:
foo[i] = myfunc(foo[i], bar[i])
You have to go in and dig into why exactly foo is being fed back in at every step. But if instead you saw: for i in loop:
foo[i] = myfunc(bar[i])
You immediately know you have more options at your disposal to optimize that loop. Further, the compiler immediately knows it has more options. Maybe this is a good option for automatic unrolling or Duff's device. Maybe you wanna send this to a gpu shader, or use a macro to parallelize the loop. But it's a lot harder to get to that point if your starting point is the complected, stateful one.In the 2 additions version, computation of the next iteration depends on the results of the preceding iteration. In the multiplication version, the computations are independent for each iteration, enabling parallel execution (by SIMD and/or pipelined/superscalar execution).
but im to lazy to test that