There is nothing standard about the optimizations -- direct AVX use is enormously uncommon, even among extremely high performance code.
But I wouldn't say writing media and signal processing inner loops using SIMD intrinsics is uncommon. The style of optimization I illustrate is pretty common, perhaps minus the AVX code path. Most widely used video/image processing, ray tracing and other compute bound libraries will probably be SIMD optimized in some fashion (probably with different code paths for different processors). You gain 1-8x performance, which is pretty significant. It's on the same order of magnitude speedup as threading your program.
I have yet to see anyone truly and systematically trusting automatic vectorization, but perhaps there are libs out there I've missed. Anyone know of some?
I made many of the same transformations as you did: switching the object's data to a structure of arrays, splitting out the computation of the normal from the loop, etc. (even an int hit = -1.) My goal was to coax Intel's compiler into autovectorizing that loop, without directly using vector intrinsics. I succeeded, but the result turned out to be noticeably slower than just compiling kid0man's with -fast. Part of that, I suspect is that it generated suboptimal code for the reduction over the minimum, where a human programmer would have used a movemask as you did.
That said, I'm fairly curious to experiment with seeing how it would perform with the kernel compiled via ispc [1].
Regarding direct use of SIMD intrinsics in inner loops, I have to agree with you that it's still reasonably common for this type of thing. I've certainly done it before in ray tracing contexts [2], and I've seen many others do it as well, e.g. [3] and [4]. Autovectorization and things like Intel's array notation extension [5] seem to be getting better all the time, but I don't think it's generally as performant yet as direct use of intrinsics. In the cases where it is, it usually seems to have taken a fair amount of coaxing and prodding.
[2] http://www.cs.utah.edu/~aek/research/triangle.pdf
[3] https://github.com/embree/embree
[4] http://visual-computing.intel-research.net/publications/pape...
[5] http://software.intel.com/en-us/blogs/2010/09/03/simd-parall...
I actually did a version which did the reduction over minimum purely using SIMD and then a post step which reduced the SIMD minimums to a single scalar. It was somewhat tricky to get the index right, and in the end it turned out to not be faster (at least not for the little example of 32 objects, I imagine you would gain something on a more complex scene)
Anyway, it was a fun little exercise and it has sparked some interesting discussion. Thanks for posting the original.
But at that point this has nothing to do with Go or C++, and I find this whole discussion rather disingenuous (at first I thought you were detailing the maturity of C(++) compilers and their superior support of auto-vectorization, which would be a reasonable angle): You can import the Intel math libraries and call them from Go (I know, as I do it regularly. See my submissions).
I still think that a systems programming language need to offer escape hatches, whilst striving towards ease of use in the common case. C++ has plenty of hatches, at the cost of horrific complexity.
But suppose I'm willing to pay the cost of writing my code in 5 different code paths for different processors for that extra 2-4x of performance. Very few languages offer that possibility, and most of those who do only offer to call a C library. I'm the guy stuck writing the Intel math libraries of the world, and I want something more reasonable to do it in.