I'm not sure renaming the primitive operations provided by intel/amd to something "nicer" would help much here. Using plain SIMD will always be ugly and at least you can Google the names and get back the Intel documentation without first translating from a different name.
I think OpenMP / Intel Autovectorizers / etc. etc. are all taking the wrong approach. The "graphics guys" have figured out a better model for thinking in SIMD.
With that being said, normal code has major issues before it can be converted into "Graphics-SIMD" form. Most importantly: data-layout is straight up "wrong", with most data-layouts in AOS (array of structs) instead of SOA form.
Writing code that interfaces between AOS and SOA is tedious, and I'm unconvinced that any general solution can be done automatically. (Remember: the key is to convert between the forms "efficiently", because the only reason we're putting up with SIMD at all is due to performance reasons).
Because it is low level, it won't be fancy, but you can find several library that wrap those low-level ops in more fancy APIs.
Comparing it with CUDA is hardly a good comparison. Even if the GPU is basically a bunch of SIMD unit, GPGPU programming is still very different than adding SIMD capability to an x86 program.
With that being said, there's good reason to use raw intrinsics in modern code. But ipsc / CUDA model is superior for most uses in my experience. Its just easier to think about.
The main issue with IPSC is that you're innately SOA, and the data-layout is just different compared to how people normally organize their data. Data-layout issues (AOS vs SOA) are probably one of the most tedious issues to deal with when using SIMD.
For the "interface", where you're converting AOS to SOA, manual intrinsics can help.
Of course, which is why the language already allows you to do that if you want to, and it often performs better than ISPC, while being memory and thread safe.
However, because Rust is a low-level language, it also allows you to write low-level code that uses assembly-like intrinsics for specific instructions manually, which is what this blog post shows.
I personally think that if your goal is to teach a new programming paradigm, like data-parallel programing (SIMD/SIMT/..), using assembly is a pretty inefficient way to do that. If you already know a data-parallel programming language like ISPC, then there is a lot of value on learning to which assembly instructions your code should lower to on each hardware and getting an intuition for that.
This also is a nice way of handling a limited subset of other assembly instructions for systems programming while they figure out how to have inline assembly without coupling the language to its implementation.
I write a lot of Rust code at work, and I admit that it can sometimes be pretty noisy. There are several major contributors to this:
1. Rust offers fine-grained control over pass-by-value, pass-by-reference, and pass-by-mutable reference. This is great for performance. But it also adds a lot of "&" and "&mut" and "x.to_owned()" clutter everywhere.
2. Rust provides support for generics (aka parameterized types). Once again, this is great for performance, and it also allows better compile-time error detection. But again, you wind up adding a lot of "<T>" and "where T:" clutter everywhere.
3. Usually, Rust can automatically infer lifetimes. But every once in a while, you want to do something messy, and you end up needing to write out the lifetimes manually. This is when you end up seeing weird things like "'a". But in my experience, this is pretty rare unless I'm doing something hairy. And if I'm doing something hairy, I'm just as happy to have more explicit documentation in the source code.
Really, the underlying problem here is that (a) Rust fills the same high-control, high-performance niche as C++, but (b) Rust prefers explicit control where C++ sometimes offers magic, invisible conversions. (Yes, I declare all my C++ constructors "explicit" and avoid conversion operators.)
Syntax is a hard problem, and I've struggled to get syntax right for even tiny languages. But syntax for languages with low-level control is an even harder problem. At some point, you just need to make a decision and get used to it.
In practice, I really enjoy writing Rust. It's definitely not as simple as Ruby, Python or Go. But it fills a very different ecological niche, with finer-grained control over memory representations, and support for generics.
Pretty weak speedup, maybe a straight up n-body implementation would see closer to the 8x theoretical speedup.
That might be part of the reason. Even with experience it's really hard to optimize code without detailed profiling. Either with a profiler that shows clock-ticks per instruction or by making very small changes to your code and keep a log of the total running time after each change.
> That might be part of the reason.
Yes, a cosine calculation should dominate all the rest of the computation. Grepping through https://www.agner.org/optimize/instruction_tables.pdf, the latency of FCOS is listed as at least 10x the latency of a floating-point add or multiply across pretty much all microarchitectures.
I'm also unsure about re-packing the results of the cosine just to allow a single multiply, the results of which are then unpacked again. It might be faster to just do that multiply in scalar code, though that's exactly the thing that would need to be measured.
I don't know anything about Rust, but a nicer word is probably "intrinsics". They usually compile to a single instruction.
This is a low-level lib. They don't want to hide anything. If you see _mm* you know you are using AVX and which version (which is important to know which CPU is supported).
High level lib do use more natural names.
https://www.youtube.com/watch?v=4Gs_CA_vm3o
I also use Rust but its perfectly fine for learning about intrinsics in C/C++ or .NET as well. I cover some of the fundamental strategies for using them well, how to lay out data in memory, how to deal with branches, etc.