Revisiting the Business Card Raytracer (opens in new tab)

(fabiensanglard.net)

176 pointsvanni6y ago52 comments

52 comments

37 comments · 11 top-level

bigcheesegs6y ago· 9 in thread

> Opening the binary with Binary Ninja revealed that clang had already managed to leverage the SSE registers.

X86-64 uses SSE registers for all floating point operations. I'm not sure that the author realized that they were looking at an -O0 binary. -O0 does not do vectorization (or anything else for that matter).

slavik816y ago

Looking at it on Godbolt, it doesn't really leverage SSE on -O3, either. You can get a reasonable grasp of whether it's using SSE effectively or not just by looking at the instruction names.

mulss: multiplication of a single single-precision floating point value.

mulsd: multiplication of a single double-precision floating point value.

mulps: multiplication of a packed group of single-precision floating point values.

mulpd: multiplication of a packed group of double-precision floating point values.

If you're mostly seeing -ps suffixes only on moves and shuffles, you're looking at code that is not being vectorized. (And, actually, if you're seeing a lot of shuffles, that's also a good sign its not well-vectorized.)

Incidentally, if you're seeing unexpected -sd suffixes, those are often due to unintended conversions between float and double. They can have a noticeable effect on performance, especially if you end up calling the double versions of math functions (as they often use iterative algorithms that need more iterations to achieve double-precision).

I'm linking GCC output, because it's simpler to follow, but you see more or less the same struggle with Clang.

https://godbolt.org/z/XtVqsU

K0nserv6y ago

This post is really topical for me. I spent hours yesterday trying to write explicit SIMD code[0] for my the vector dot product in my raytracer and all I managed to do was slow the code down about 20-30%.

The code generated by Rust from the naive solution uses ss instructions mostly whereas my two tries using `mm_dp_ps` and `mm_mul_ps` and `mm_hadd_ps` where both significantly slower even though it results in fewer instructions. I suspect that the issue is that for a single dot product the overhead of loading in and out of mm128 registers is more cost than it's worth.

Naive Rust version output

    .cfi_startproc
    pushq %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset %rbp, -16
    movq  %rsp, %rbp
    .cfi_def_cfa_register %rbp
    vmovss  (%rdi), %xmm0
    vmulss  (%rsi), %xmm0, %xmm0
    vmovsd  4(%rdi), %xmm1
    vmovsd  4(%rsi), %xmm2
    vmulps  %xmm2, %xmm1, %xmm1
    vaddss  %xmm1, %xmm0, %xmm0
    vmovshdup %xmm1, %xmm1
    vaddss  %xmm1, %xmm0, %xmm0
    popq  %rbp
    retq

My handwritten version with `mm_mul_ps` and `mm_hadd_ps`

    .cfi_startproc
    pushq %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset %rbp, -16
    movq  %rsp, %rbp
    .cfi_def_cfa_register %rbp
    vmovaps (%rdi), %xmm0
    vmulps  (%rsi), %xmm0, %xmm0
    vhaddps %xmm0, %xmm0, %xmm0
    vhaddps %xmm0, %xmm0, %xmm0
    popq  %rbp
    retq

Intuatively it feels like my version should be faster but it isn't. In this code I changed the the struct from 3 f32 components to an array with 4 f32 elements to avoid having to create the array during computation itself, the code also requires specific alignment not to segfault which I guess might also affected performance.

0: https://github.com/k0nserv/rusttracer/commits/SIMD-mm256-dp-...

2 more replies

gorgoiler6y ago

Off topic: I teach compilers in high school and godbolt.org looks amazing, thanks for the link!

skavi6y ago

Wow, really cool that there are high schools teaching compilers.

1 more reply

rumanator6y ago

Wow that's some next level stuff. How do high schoolers cope with the topic?

1 more reply

fabiensanglard6y ago

You are correct. Mārtiņš Možeiko pointed out that I had been too hasty when the article came out (https://twitter.com/mmozeiko/status/1257574246462570497). To conclude SIMD is leveraged when XMM registers are used is wrong. What I should have looked for are packed instructions.

mappu6y ago

It would be interesting to see how an ispc version performs, if it is able to extract any more CPU parallelism.

moosedev6y ago

Yeah, use of SSE registers does not imply SIMD, since x87 is gone in x86-64, so even scalar FP has to use SSE registers. The asm snippets for v::operator*() in the "Optimization level 1" section use scalar SSE arithmetic only (mulss). (There's some use of movaps to move data around, but it's a stretch to call that SIMD.)

I think the "leverage" sentence you quoted and the "with SIMD taken care of" one shortly after are maybe a bit misleading, since the asm snippets there don't really demonstrate SIMD.

amluto6y ago

> since x87 is gone in x86-64, so even scalar FP has to use SSE registers.

No, it’s still there. What’s actually going on is that all x86-64 CPUs support SSE2, so there is little reason to use x87 in 64-bit code.

(You can use it for 80-bit precision. OTOH, for most purposes, 80-bit precision is actively harmful, and x87 is an incredible mess, so almost no one wants it.)

2 more replies

blondin6y ago· 5 in thread

i love what fabien is doing with his website!

also been experimenting with pure html with an itsy-bitsy amount of css. for months now i wondered how to display code without involving javascript.

that textarea is so perfect! and i bet you when you copy and paste into word or your todo list application they won't even try to be "smart" about knowing what "rich text" is...

that's very cool.

GuiA6y ago

i am glad that there is a small village of indomitable developers holding out against tens of megabytes of reactive functional progress modern javascript frameworks.

thank you

wetmore6y ago

There is a place for both.

GuiA6y ago

It’s not so much that there’s a place for both rather than both necessarily exist as two points in the same space.

But the reality is that more websites than not these days will send you many megabytes of JS, mainly for the purpose of tracking you and extracting money/time from you, under the guise of “user experience”.

So when I see some of those rare people who still actually care about quality, speed, performance, accessibility, etc I make sure to appreciate their work.

Kwantuum6y ago

> that textarea is so perfect!

There are no textarea on that page. The code sections are using a <pre>

nucleardog6y ago

Yeah, damn, I thought I'd done well getting my page+css+font down to ~100kB (css/font cached after first load, so ~2kB on subsequent pages) but his site is tiny. Even with his CSS inlined to save a request the entire page is 15kB gzipped.

tomsmeding6y ago· 4 in thread

Nice work on the GPU programming, and the multicore before that, but I'm mystified why going from -O0 to -O3 is named an "optimisation". All respect for Fabien, but running code that's supposed to run faster than a snail (and if you're not debugging and require -O0 for reasonable output) implies -O2 or -O3. (In practice, -O3 often doesn't give much performance over -O2, despite increasing compile times.)

The initial time is not 101.8 seconds, it's 11.6 seconds.

slavik816y ago

You could also add -ffast-math, which loosens the rules for floating point optimizations. For example, it would allow the compiler to turn floating point divisions into multiplications by an inverse, and to group operations more efficiently even if doing so would slightly affect rounding. It also rounds denormal numbers down to zero, which can greatly improve performance on a lot of hardware.

-march=native may also be useful, as it would allow the compiler to use newer CPU instructions, and tune the generated code to your hardware. That would make the program less portable, but it's not like CUDA is portable either.

My machine matches those numbers surprisingly closely. With -O0 it took 89.6s. With -O3, it took 11.7s. With -Ofast (which combines -O3 and -ffast-math), it took 10.6s. With -Ofast -march=native, it took 8.9s. I would expect those gains to extrapolate to the multi-threaded version, maybe pushing it down to 1 second without any further work. (Note: I'm using GCC on Ubuntu 18.04 with a Haswell i7. Your mileage may vary.)

amelius6y ago

Offtopic, but I think that languages should have special float types that trigger the use of fast math. That way, a programmer can better control which parts of a program are done with approximate floating point operations.

1 more reply

Natasha35Khan6y ago

Agree with you

Aleeshakhan7866y ago

Me too

ntry6y ago· 3 in thread

101,000ms to 150ms is a phenomenal speedup. Props

correct_horse6y ago

To be fair, the thing he started with was totally unreadable and fits on a business card and the thing he ended up with was readable, but didn't fit. It isn't apples-to-apples.

sp3326y ago

The original is 1,287 bytes after removing comments and line endings where possible. After just a bit of search-and-replace to shrink variable names, I got the new one down to 2,373 bytes including the same error checking and output at the end. So 1.8 business cards, which is not too bad.

And no I haven't tried to compile it. https://pastebin.com/LDRd6U4e

a13692099936y ago

Hmm, some further improvements:

  typedef float F;typedef int I;
  // saves 2 lines (but zero bytes)
  
  #define R return
  // OR
  #define O(S,A,R) operator S(A){return R;}
  
  #define E(F){E_((F||cudaPeekAtLastError)(),__FILE__,__LINE__);}

and so on - although you'd probably have to make semantic chages to get onto a single card. (Or use a smaller font, but that's presumably cheating.)

1 more reply

rwmj6y ago· 2 in thread

Is there a mistake in the original code on the right hand side? I get:

  card.cpp:16:2: error: ‘g’ was not declared in this scope
     16 | <g;p)t=p,n=v(0,0,1),m=1;for(i k=19;k--;)
        |  ^

Edit: Yes there is. The ‘<g;’ seems like it should have been the single character ‘<’, perhaps a corrupted HTML escape.

a_e_k6y ago

(Original author of this code[0] here.) Yes, that looks like a corrupted copy. The line should be:

    <p)t=p,n=v(0,0,1),m=1;for(i k=19;k--;)

[0] http://www.cs.utah.edu/~aek/code/card.cpp

joeraut6y ago

Yep, I got the same error. The version here [0] works fine.

[0] https://fabiensanglard.net/postcard_pathtracer/

lonk6y ago· 2 in thread

If you increase resolution you can put more code on business card :P

Y_Y6y ago

If you make your business cards out of silicon wafer I can get you 10nm resolution.

mjcohen6y ago

4 point font!

ectoplasmaboiii6y ago· 1 in thread

A Ray-Tracer in 7 Lines of K: http://nsl.com/k/ray/ray.k

jagged-chisel6y ago

7? Just remove those comments and the last few newlines and it’ll be one line.

I don’t know K, but it looks like it uses semicolons to end statements. It’s “cheating” on line count to compress statements by just removing \n.

After all, how many “lines” are in the business card raytracer? 4.

danielscrubs6y ago

Well there goes my weekend. :)

I also tried to optimise the code, and got great speed increases with just constexpr the vector methods and could quickly see that rand was problematic and then Fabien releases this post with nvcc that are another level. Really great blog post!

rrss6y ago

This was really fun to read, thanks fsanglard.

> This is correlated with the warning nvcc issued. Because the raytracer uses recursion, it uses a lot of stacks. So much actually that the SM cannot keep more than a few alive.

Stack frame size / "local memory" size doesn't actually directly limit occupancy. There's a list of the limiters here: https://docs.nvidia.com/gameworks/content/developertools/des.... I'm not sure why the achieved occupancy went up after removing the recursion, but I'd guess it was something like the compiler was able to reduce register usage.

fegu6y ago

Almost into passable frame rate territory. Next version could be business card VR:)

mianos6y ago

I wonder if the tool-chain would be better under Linux? It is kind if funny the way Windows development has always been a hassle. Mscvars.bat and such has been there for at least 20 years.

j / k navigate · click thread line to collapse

52 comments

37 comments · 11 top-level

bigcheesegs6y ago· 9 in thread

> Opening the binary with Binary Ninja revealed that clang had already managed to leverage the SSE registers.

slavik816y ago

Looking at it on Godbolt, it doesn't really leverage SSE on -O3, either. You can get a reasonable grasp of whether it's using SSE effectively or not just by looking at the instruction names.

mulss: multiplication of a single single-precision floating point value.

mulsd: multiplication of a single double-precision floating point value.

mulps: multiplication of a packed group of single-precision floating point values.

mulpd: multiplication of a packed group of double-precision floating point values.

I'm linking GCC output, because it's simpler to follow, but you see more or less the same struggle with Clang.

https://godbolt.org/z/XtVqsU

K0nserv6y ago

Naive Rust version output

    .cfi_startproc
    pushq %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset %rbp, -16
    movq  %rsp, %rbp
    .cfi_def_cfa_register %rbp
    vmovss  (%rdi), %xmm0
    vmulss  (%rsi), %xmm0, %xmm0
    vmovsd  4(%rdi), %xmm1
    vmovsd  4(%rsi), %xmm2
    vmulps  %xmm2, %xmm1, %xmm1
    vaddss  %xmm1, %xmm0, %xmm0
    vmovshdup %xmm1, %xmm1
    vaddss  %xmm1, %xmm0, %xmm0
    popq  %rbp
    retq

My handwritten version with `mm_mul_ps` and `mm_hadd_ps`

    .cfi_startproc
    pushq %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset %rbp, -16
    movq  %rsp, %rbp
    .cfi_def_cfa_register %rbp
    vmovaps (%rdi), %xmm0
    vmulps  (%rsi), %xmm0, %xmm0
    vhaddps %xmm0, %xmm0, %xmm0
    vhaddps %xmm0, %xmm0, %xmm0
    popq  %rbp
    retq

0: https://github.com/k0nserv/rusttracer/commits/SIMD-mm256-dp-...

2 more replies

gorgoiler6y ago

Off topic: I teach compilers in high school and godbolt.org looks amazing, thanks for the link!

skavi6y ago

Wow, really cool that there are high schools teaching compilers.

1 more reply

rumanator6y ago

Wow that's some next level stuff. How do high schoolers cope with the topic?

1 more reply

fabiensanglard6y ago

mappu6y ago

It would be interesting to see how an ispc version performs, if it is able to extract any more CPU parallelism.

moosedev6y ago

I think the "leverage" sentence you quoted and the "with SIMD taken care of" one shortly after are maybe a bit misleading, since the asm snippets there don't really demonstrate SIMD.

amluto6y ago

> since x87 is gone in x86-64, so even scalar FP has to use SSE registers.

No, it’s still there. What’s actually going on is that all x86-64 CPUs support SSE2, so there is little reason to use x87 in 64-bit code.

(You can use it for 80-bit precision. OTOH, for most purposes, 80-bit precision is actively harmful, and x87 is an incredible mess, so almost no one wants it.)

2 more replies

blondin6y ago· 5 in thread

i love what fabien is doing with his website!

also been experimenting with pure html with an itsy-bitsy amount of css. for months now i wondered how to display code without involving javascript.

that textarea is so perfect! and i bet you when you copy and paste into word or your todo list application they won't even try to be "smart" about knowing what "rich text" is...

that's very cool.

GuiA6y ago

i am glad that there is a small village of indomitable developers holding out against tens of megabytes of reactive functional progress modern javascript frameworks.

thank you

wetmore6y ago

There is a place for both.

GuiA6y ago

It’s not so much that there’s a place for both rather than both necessarily exist as two points in the same space.

So when I see some of those rare people who still actually care about quality, speed, performance, accessibility, etc I make sure to appreciate their work.

Kwantuum6y ago

> that textarea is so perfect!

There are no textarea on that page. The code sections are using a <pre>

nucleardog6y ago

tomsmeding6y ago· 4 in thread

The initial time is not 101.8 seconds, it's 11.6 seconds.

slavik816y ago

amelius6y ago

1 more reply

Natasha35Khan6y ago

Agree with you

Aleeshakhan7866y ago

Me too

ntry6y ago· 3 in thread

101,000ms to 150ms is a phenomenal speedup. Props

correct_horse6y ago

To be fair, the thing he started with was totally unreadable and fits on a business card and the thing he ended up with was readable, but didn't fit. It isn't apples-to-apples.

sp3326y ago

And no I haven't tried to compile it. https://pastebin.com/LDRd6U4e

a13692099936y ago

Hmm, some further improvements:

  typedef float F;typedef int I;
  // saves 2 lines (but zero bytes)
  
  #define R return
  // OR
  #define O(S,A,R) operator S(A){return R;}
  
  #define E(F){E_((F||cudaPeekAtLastError)(),__FILE__,__LINE__);}

and so on - although you'd probably have to make semantic chages to get onto a single card. (Or use a smaller font, but that's presumably cheating.)

1 more reply

rwmj6y ago· 2 in thread

Is there a mistake in the original code on the right hand side? I get:

  card.cpp:16:2: error: ‘g’ was not declared in this scope
     16 | <g;p)t=p,n=v(0,0,1),m=1;for(i k=19;k--;)
        |  ^

Edit: Yes there is. The ‘<g;’ seems like it should have been the single character ‘<’, perhaps a corrupted HTML escape.

a_e_k6y ago

(Original author of this code[0] here.) Yes, that looks like a corrupted copy. The line should be:

    <p)t=p,n=v(0,0,1),m=1;for(i k=19;k--;)

[0] http://www.cs.utah.edu/~aek/code/card.cpp

joeraut6y ago

Yep, I got the same error. The version here [0] works fine.

[0] https://fabiensanglard.net/postcard_pathtracer/

lonk6y ago· 2 in thread

If you increase resolution you can put more code on business card :P

Y_Y6y ago

If you make your business cards out of silicon wafer I can get you 10nm resolution.

mjcohen6y ago

4 point font!

ectoplasmaboiii6y ago· 1 in thread

A Ray-Tracer in 7 Lines of K: http://nsl.com/k/ray/ray.k

jagged-chisel6y ago

7? Just remove those comments and the last few newlines and it’ll be one line.

I don’t know K, but it looks like it uses semicolons to end statements. It’s “cheating” on line count to compress statements by just removing \n.

After all, how many “lines” are in the business card raytracer? 4.

danielscrubs6y ago

Well there goes my weekend. :)

rrss6y ago

This was really fun to read, thanks fsanglard.

> This is correlated with the warning nvcc issued. Because the raytracer uses recursion, it uses a lot of stacks. So much actually that the SM cannot keep more than a few alive.

fegu6y ago

Almost into passable frame rate territory. Next version could be business card VR:)

mianos6y ago

I wonder if the tool-chain would be better under Linux? It is kind if funny the way Windows development has always been a hassle. Mscvars.bat and such has been there for at least 20 years.

j / k navigate · click thread line to collapse