undefined | Better HN

0 pointscorresation12y ago0 comments

C++ code is 5x faster after some standard optimizations

There is nothing standard about the optimizations -- direct AVX use is enormously uncommon, even among extremely high performance code.

0 comments

6 comments · 1 top-level

abelsson12y ago· 5 in thread

I'll grant you that - AVX is pretty uncommon. I originally wrote it with SSE2 (there's a working version in the next to last commit on the github repo), but rewrote it using AVX because.. well, I hadn't used it before.

But I wouldn't say writing media and signal processing inner loops using SIMD intrinsics is uncommon. The style of optimization I illustrate is pretty common, perhaps minus the AVX code path. Most widely used video/image processing, ray tracing and other compute bound libraries will probably be SIMD optimized in some fashion (probably with different code paths for different processors). You gain 1-8x performance, which is pretty significant. It's on the same order of magnitude speedup as threading your program.

I have yet to see anyone truly and systematically trusting automatic vectorization, but perhaps there are libs out there I've missed. Anyone know of some?

a_e_k12y ago

Hi there, speaking of autovectorization, I actually tried it on this last night. After seeing kid0man's post, I decided to try optimizing it and selected that same loop as my target. (When I wrote the original C++ program, I was favoring conciseness and portability over performance, naturally.)

I made many of the same transformations as you did: switching the object's data to a structure of arrays, splitting out the computation of the normal from the loop, etc. (even an int hit = -1.) My goal was to coax Intel's compiler into autovectorizing that loop, without directly using vector intrinsics. I succeeded, but the result turned out to be noticeably slower than just compiling kid0man's with -fast. Part of that, I suspect is that it generated suboptimal code for the reduction over the minimum, where a human programmer would have used a movemask as you did.

That said, I'm fairly curious to experiment with seeing how it would perform with the kernel compiled via ispc [1].

Regarding direct use of SIMD intrinsics in inner loops, I have to agree with you that it's still reasonably common for this type of thing. I've certainly done it before in ray tracing contexts [2], and I've seen many others do it as well, e.g. [3] and [4]. Autovectorization and things like Intel's array notation extension [5] seem to be getting better all the time, but I don't think it's generally as performant yet as direct use of intrinsics. In the cases where it is, it usually seems to have taken a fair amount of coaxing and prodding.

[1] http://ispc.github.io/

[2] http://www.cs.utah.edu/~aek/research/triangle.pdf

[3] https://github.com/embree/embree

[4] http://visual-computing.intel-research.net/publications/pape...

[5] http://software.intel.com/en-us/blogs/2010/09/03/simd-parall...

abelsson12y ago

As an intermediate step, before starting with the SSE intrinsics I rewrote the code in a form that should have been reasonably suitable for an autovectorizer (an inner loop over a fixed number of elements - I imagine it probably looked fairly similar to your code), but my gcc with -ftree-vectorize didn't do much with it. I didn't really explore that path further though.

I actually did a version which did the reduction over minimum purely using SIMD and then a post step which reduced the SIMD minimums to a single scalar. It was somewhat tricky to get the index right, and in the end it turned out to not be faster (at least not for the little example of 32 objects, I imagine you would gain something on a more complex scene)

Anyway, it was a fun little exercise and it has sparked some interesting discussion. Thanks for posting the original.

corresationOP12y ago

But I wouldn't say writing media and signal processing inner loops using SIMD intrinsics is uncommon.

But at that point this has nothing to do with Go or C++, and I find this whole discussion rather disingenuous (at first I thought you were detailing the maturity of C(++) compilers and their superior support of auto-vectorization, which would be a reasonable angle): You can import the Intel math libraries and call them from Go (I know, as I do it regularly. See my submissions).

abelsson12y ago

No, it doesn't. I tried to make that point too, but perhaps it didn't come across very clearly. For these kind of tight inner loops a language only ever gets in the way, the difference is really only how difficult it is to get rid of the conveniences you don't want. (The much vaunted zero-cost of features you don't use in C++ lingo I guess)

I still think that a systems programming language need to offer escape hatches, whilst striving towards ease of use in the common case. C++ has plenty of hatches, at the cost of horrific complexity.

But suppose I'm willing to pay the cost of writing my code in 5 different code paths for different processors for that extra 2-4x of performance. Very few languages offer that possibility, and most of those who do only offer to call a C library. I'm the guy stuck writing the Intel math libraries of the world, and I want something more reasonable to do it in.

1 more reply

pjmlp12y ago

Funny, where in the ANSI C++ standard is the entry for AVX/SSE2?

j / k navigate · click thread line to collapse

0 comments

6 comments · 1 top-level

abelsson12y ago· 5 in thread

I have yet to see anyone truly and systematically trusting automatic vectorization, but perhaps there are libs out there I've missed. Anyone know of some?

a_e_k12y ago

That said, I'm fairly curious to experiment with seeing how it would perform with the kernel compiled via ispc [1].

[1] http://ispc.github.io/

[2] http://www.cs.utah.edu/~aek/research/triangle.pdf

[3] https://github.com/embree/embree

[4] http://visual-computing.intel-research.net/publications/pape...

[5] http://software.intel.com/en-us/blogs/2010/09/03/simd-parall...

abelsson12y ago

Anyway, it was a fun little exercise and it has sparked some interesting discussion. Thanks for posting the original.

corresationOP12y ago

But I wouldn't say writing media and signal processing inner loops using SIMD intrinsics is uncommon.

abelsson12y ago

I still think that a systems programming language need to offer escape hatches, whilst striving towards ease of use in the common case. C++ has plenty of hatches, at the cost of horrific complexity.

1 more reply

pjmlp12y ago

Funny, where in the ANSI C++ standard is the entry for AVX/SSE2?

j / k navigate · click thread line to collapse