X86-64 uses SSE registers for all floating point operations. I'm not sure that the author realized that they were looking at an -O0 binary. -O0 does not do vectorization (or anything else for that matter).
mulss: multiplication of a single single-precision floating point value.
mulsd: multiplication of a single double-precision floating point value.
mulps: multiplication of a packed group of single-precision floating point values.
mulpd: multiplication of a packed group of double-precision floating point values.
If you're mostly seeing -ps suffixes only on moves and shuffles, you're looking at code that is not being vectorized. (And, actually, if you're seeing a lot of shuffles, that's also a good sign its not well-vectorized.)
Incidentally, if you're seeing unexpected -sd suffixes, those are often due to unintended conversions between float and double. They can have a noticeable effect on performance, especially if you end up calling the double versions of math functions (as they often use iterative algorithms that need more iterations to achieve double-precision).
I'm linking GCC output, because it's simpler to follow, but you see more or less the same struggle with Clang.
The code generated by Rust from the naive solution uses ss instructions mostly whereas my two tries using `mm_dp_ps` and `mm_mul_ps` and `mm_hadd_ps` where both significantly slower even though it results in fewer instructions. I suspect that the issue is that for a single dot product the overhead of loading in and out of mm128 registers is more cost than it's worth.
Naive Rust version output
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
vmovss (%rdi), %xmm0
vmulss (%rsi), %xmm0, %xmm0
vmovsd 4(%rdi), %xmm1
vmovsd 4(%rsi), %xmm2
vmulps %xmm2, %xmm1, %xmm1
vaddss %xmm1, %xmm0, %xmm0
vmovshdup %xmm1, %xmm1
vaddss %xmm1, %xmm0, %xmm0
popq %rbp
retq
My handwritten version with `mm_mul_ps` and `mm_hadd_ps` .cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
vmovaps (%rdi), %xmm0
vmulps (%rsi), %xmm0, %xmm0
vhaddps %xmm0, %xmm0, %xmm0
vhaddps %xmm0, %xmm0, %xmm0
popq %rbp
retq
Intuatively it feels like my version should be faster but it isn't. In this code I changed the the struct from 3 f32 components to an array with 4 f32 elements to avoid having to create the array during computation itself, the code also requires specific alignment not to segfault which I guess might also affected performance.0: https://github.com/k0nserv/rusttracer/commits/SIMD-mm256-dp-...
On microcode level, you can generally have multiple of the same kind of instruction running "in parallel" if they are independent.
For example, look at 256 bit vmulps here: https://software.intel.com/sites/landingpage/IntrinsicsGuide...
On Ivy Bridge, you can start one vmulps per cycle, but it takes 5 cycles before you get a result. If you do several vmulps (and similar) in one long dependency chain, you will only progress by one instruction every 5 cycles!
Another point to consider is that multithreading in a hyperthreading environment could change these result: If I'm not mistaken, the two hyperthreads sharing a core compete for the same execution ports but have separate instruction scheduling. What this means is that, in the above scenario, you could theoretically have two hyperthreads each executing one vmulps every five cycles on the same core, so that you actually get double the speed from two threads over one. However, less dependency-laden code (the scalar version?) could fully saturate the floating point related execution ports with just one thread, in which case you might not see any speed benefit at all from a second thread.
This of course strongly depends on the hardware and how how the code is. I'm also not confident that either of these effects are necessarily at play or the prime influence here. But if you are interested in writing well-performing code on this level, these are topics you should look into!
I think the "leverage" sentence you quoted and the "with SIMD taken care of" one shortly after are maybe a bit misleading, since the asm snippets there don't really demonstrate SIMD.
No, it’s still there. What’s actually going on is that all x86-64 CPUs support SSE2, so there is little reason to use x87 in 64-bit code.
(You can use it for 80-bit precision. OTOH, for most purposes, 80-bit precision is actively harmful, and x87 is an incredible mess, so almost no one wants it.)
card.cpp:16:2: error: ‘g’ was not declared in this scope
16 | <g;p)t=p,n=v(0,0,1),m=1;for(i k=19;k--;)
| ^
Edit: Yes there is. The ‘<g;’ seems like it should have been the single character ‘<’, perhaps a corrupted HTML escape. <p)t=p,n=v(0,0,1),m=1;for(i k=19;k--;)
[0] http://www.cs.utah.edu/~aek/code/card.cppI also tried to optimise the code, and got great speed increases with just constexpr the vector methods and could quickly see that rand was problematic and then Fabien releases this post with nvcc that are another level. Really great blog post!
I don’t know K, but it looks like it uses semicolons to end statements. It’s “cheating” on line count to compress statements by just removing \n.
After all, how many “lines” are in the business card raytracer? 4.
also been experimenting with pure html with an itsy-bitsy amount of css. for months now i wondered how to display code without involving javascript.
that textarea is so perfect! and i bet you when you copy and paste into word or your todo list application they won't even try to be "smart" about knowing what "rich text" is...
that's very cool.
thank you
But the reality is that more websites than not these days will send you many megabytes of JS, mainly for the purpose of tracking you and extracting money/time from you, under the guise of “user experience”.
So when I see some of those rare people who still actually care about quality, speed, performance, accessibility, etc I make sure to appreciate their work.
There are no textarea on that page. The code sections are using a <pre>
> This is correlated with the warning nvcc issued. Because the raytracer uses recursion, it uses a lot of stacks. So much actually that the SM cannot keep more than a few alive.
Stack frame size / "local memory" size doesn't actually directly limit occupancy. There's a list of the limiters here: https://docs.nvidia.com/gameworks/content/developertools/des.... I'm not sure why the achieved occupancy went up after removing the recursion, but I'd guess it was something like the compiler was able to reduce register usage.
And no I haven't tried to compile it. https://pastebin.com/LDRd6U4e
typedef float F;typedef int I;
// saves 2 lines (but zero bytes)
#define R return
// OR
#define O(S,A,R) operator S(A){return R;}
#define E(F){E_((F||cudaPeekAtLastError)(),__FILE__,__LINE__);}
and so on - although you'd probably have to make semantic chages to get onto a single card. (Or use a smaller font, but that's presumably cheating.)The initial time is not 101.8 seconds, it's 11.6 seconds.
-march=native may also be useful, as it would allow the compiler to use newer CPU instructions, and tune the generated code to your hardware. That would make the program less portable, but it's not like CUDA is portable either.
My machine matches those numbers surprisingly closely. With -O0 it took 89.6s. With -O3, it took 11.7s. With -Ofast (which combines -O3 and -ffast-math), it took 10.6s. With -Ofast -march=native, it took 8.9s. I would expect those gains to extrapolate to the multi-threaded version, maybe pushing it down to 1 second without any further work. (Note: I'm using GCC on Ubuntu 18.04 with a Haswell i7. Your mileage may vary.)