Learning SIMD with Rust by Finding Planets (2018) (opens in new tab)

(medium.com)

112 pointsbtashton6y ago29 comments

29 comments

22 comments · 4 top-level

krapht6y ago· 14 in thread

This looks kinda gross to me. Do the rust developers not want to emulate what ipsc and cuda do? Writing intrinsics by hand is not what I expect from a 2019 language.

sgift6y ago

There are libraries abstracting the SIMD calls if that's what you want, e.g. faster (https://github.com/AdamNiederer/faster) or simdeez (https://github.com/jackmott/simdeez), but these have to use basic operations at the end of the day too and this post shows how to do that.

I'm not sure renaming the primitive operations provided by intel/amd to something "nicer" would help much here. Using plain SIMD will always be ugly and at least you can Google the names and get back the Intel documentation without first translating from a different name.

dragontamer6y ago

I disagree with the parent poster's phrasing, but they have a point: CUDA (for GPUs) and IPSC (for CPUs) makes SIMD code far easier to write.

I think OpenMP / Intel Autovectorizers / etc. etc. are all taking the wrong approach. The "graphics guys" have figured out a better model for thinking in SIMD.

With that being said, normal code has major issues before it can be converted into "Graphics-SIMD" form. Most importantly: data-layout is straight up "wrong", with most data-layouts in AOS (array of structs) instead of SOA form.

Writing code that interfaces between AOS and SOA is tedious, and I'm unconvinced that any general solution can be done automatically. (Remember: the key is to convert between the forms "efficiently", because the only reason we're putting up with SIMD at all is due to performance reasons).

maeln6y ago

This is an example of doing manual SIMD. In some cases, it's still needed. Manual SIMD tend to be faster than what even the most modern compiler can provide.

Because it is low level, it won't be fancy, but you can find several library that wrap those low-level ops in more fancy APIs.

Comparing it with CUDA is hardly a good comparison. Even if the GPU is basically a bunch of SIMD unit, GPGPU programming is still very different than adding SIMD capability to an x86 program.

dragontamer6y ago

You really should check out ipsc, which is a CPU-SIMD compiler similar to CUDA.

With that being said, there's good reason to use raw intrinsics in modern code. But ipsc / CUDA model is superior for most uses in my experience. Its just easier to think about.

The main issue with IPSC is that you're innately SOA, and the data-layout is just different compared to how people normally organize their data. Data-layout issues (AOS vs SOA) are probably one of the most tedious issues to deal with when using SIMD.

For the "interface", where you're converting AOS to SOA, manual intrinsics can help.

fluffything6y ago

> Do the rust developers not want to emulate what ipsc and cuda do?

Of course, which is why the language already allows you to do that if you want to, and it often performs better than ISPC, while being memory and thread safe.

However, because Rust is a low-level language, it also allows you to write low-level code that uses assembly-like intrinsics for specific instructions manually, which is what this blog post shows.

I personally think that if your goal is to teach a new programming paradigm, like data-parallel programing (SIMD/SIMT/..), using assembly is a pretty inefficient way to do that. If you already know a data-parallel programming language like ISPC, then there is a lot of value on learning to which assembly instructions your code should lower to on each hardware and getting an intuition for that.

BubRoss6y ago

Performing better than ISPC is a pretty bold claim, you are definitely going to need to provide a source for that. Rust developers have had people telling them about ISPC for years and have waived off any need to look at it and understand why it works so well.

1 more reply

vardump6y ago

Intrinsics is still what modern high-performance "C++" code uses. Auto-vectorizers are pretty fragile and require too much babysitting to be worth it.

The_rationalist6y ago

Semi auto vectirizers are the best compromise Cf openMP and also allow to multithread or offload your code on the gpu.

1 more reply

epage6y ago

Some further background: vendor instrinsics are well documented and unchanging. They are a lot easier for a language to standardize on. As others have pointed out, there are higher level libraries being built on top of them. This gives the Rust community the freedom to experiment on how the instrinsics should be implemented with less concerns over compatibility.

This also is a nice way of handling a limited subset of other assembly instructions for systems programming while they figure out how to have inline assembly without coupling the language to its implementation.

The_rationalist6y ago

Rust is lagging behind in the performance world, especially because it lack openMP/ACC support.

gameswithgo6y ago

The rayon library provides a lot of the same features as those.

psv16y ago

To be honest, everything in Rust looks a bit ugly to me. I really tried to like the language but the syntax, everything being overly annotated, and the number of features that you need to understand to do simple tasks - all of these make it not really worth it to pick Rust for new projects. There are other problems like the ecosystem of crates and the lack of learning resources but at least they aren't intrinsic to the language itself.

ekidd6y ago

> To be honest, everything in Rust looks a bit ugly to me.

I write a lot of Rust code at work, and I admit that it can sometimes be pretty noisy. There are several major contributors to this:

1. Rust offers fine-grained control over pass-by-value, pass-by-reference, and pass-by-mutable reference. This is great for performance. But it also adds a lot of "&" and "&mut" and "x.to_owned()" clutter everywhere.

2. Rust provides support for generics (aka parameterized types). Once again, this is great for performance, and it also allows better compile-time error detection. But again, you wind up adding a lot of "<T>" and "where T:" clutter everywhere.

3. Usually, Rust can automatically infer lifetimes. But every once in a while, you want to do something messy, and you end up needing to write out the lifetimes manually. This is when you end up seeing weird things like "'a". But in my experience, this is pretty rare unless I'm doing something hairy. And if I'm doing something hairy, I'm just as happy to have more explicit documentation in the source code.

Really, the underlying problem here is that (a) Rust fills the same high-control, high-performance niche as C++, but (b) Rust prefers explicit control where C++ sometimes offers magic, invisible conversions. (Yes, I declare all my C++ constructors "explicit" and avoid conversion operators.)

Syntax is a hard problem, and I've struggled to get syntax right for even tiny languages. But syntax for languages with low-level control is an even harder problem. At some point, you just need to make a decision and get used to it.

In practice, I really enjoy writing Rust. It's definitely not as simple as Ruby, Python or Go. But it fills a very different ecological niche, with finer-grained control over memory representations, and support for generics.

gameswithgo6y ago

this is exactly how simd intrinsics look in c, its not rust thing its an intrinsics thing.

pixelpoet6y ago· 2 in thread

> After running benchmarks with all the variants and planets, the improvement is about 9% to 12%.

Pretty weak speedup, maybe a straight up n-body implementation would see closer to the 8x theoretical speedup.

Fronzie6y ago

> but Rust does not provide the Intel _mm256_cos_pd() instruction yet.

That might be part of the reason. Even with experience it's really hard to optimize code without detailed profiling. Either with a profiler that shows clock-ticks per instruction or by making very small changes to your code and keep a log of the total running time after each change.

tom_mellior6y ago

> > but Rust does not provide the Intel _mm256_cos_pd() instruction yet.

> That might be part of the reason.

Yes, a cosine calculation should dominate all the rest of the computation. Grepping through https://www.agner.org/optimize/instruction_tables.pdf, the latency of FCOS is listed as at least 10x the latency of a floating-point add or multiply across pretty much all microarchitectures.

I'm also unsure about re-packing the results of the cosine just to allow a single multiply, the results of which are then unpacked again. It might be faster to just do that multiply in scalar code, though that's exactly the thing that would need to be measured.

qiqitori6y ago· 2 in thread

> AVX functions start with _mm256_

I don't know anything about Rust, but a nicer word is probably "intrinsics". They usually compile to a single instruction.

maeln6y ago

It's because they just use the name of the actual AVX ops (https://software.intel.com/sites/landingpage/IntrinsicsGuide...).

This is a low-level lib. They don't want to hide anything. If you see _mm* you know you are using AVX and which version (which is important to know which CPU is supported).

High level lib do use more natural names.

xiphias26y ago

The commenter was pointing out that intrinsic functions shouldn't just be called functions (I have no strong opinion on that comment). He wasn't commenting about the names of the functions themselves.

1 more reply

gameswithgo6y ago

You may enjoy my video tutorial on SIMD Intrinsics as well:

https://www.youtube.com/watch?v=4Gs_CA_vm3o

I also use Rust but its perfectly fine for learning about intrinsics in C/C++ or .NET as well. I cover some of the fundamental strategies for using them well, how to lay out data in memory, how to deal with branches, etc.

j / k navigate · click thread line to collapse

29 comments

22 comments · 4 top-level

krapht6y ago· 14 in thread

This looks kinda gross to me. Do the rust developers not want to emulate what ipsc and cuda do? Writing intrinsics by hand is not what I expect from a 2019 language.

sgift6y ago

dragontamer6y ago

I disagree with the parent poster's phrasing, but they have a point: CUDA (for GPUs) and IPSC (for CPUs) makes SIMD code far easier to write.

I think OpenMP / Intel Autovectorizers / etc. etc. are all taking the wrong approach. The "graphics guys" have figured out a better model for thinking in SIMD.

maeln6y ago

This is an example of doing manual SIMD. In some cases, it's still needed. Manual SIMD tend to be faster than what even the most modern compiler can provide.

Because it is low level, it won't be fancy, but you can find several library that wrap those low-level ops in more fancy APIs.

Comparing it with CUDA is hardly a good comparison. Even if the GPU is basically a bunch of SIMD unit, GPGPU programming is still very different than adding SIMD capability to an x86 program.

dragontamer6y ago

You really should check out ipsc, which is a CPU-SIMD compiler similar to CUDA.

With that being said, there's good reason to use raw intrinsics in modern code. But ipsc / CUDA model is superior for most uses in my experience. Its just easier to think about.

For the "interface", where you're converting AOS to SOA, manual intrinsics can help.

fluffything6y ago

> Do the rust developers not want to emulate what ipsc and cuda do?

Of course, which is why the language already allows you to do that if you want to, and it often performs better than ISPC, while being memory and thread safe.

However, because Rust is a low-level language, it also allows you to write low-level code that uses assembly-like intrinsics for specific instructions manually, which is what this blog post shows.

BubRoss6y ago

1 more reply

vardump6y ago

Intrinsics is still what modern high-performance "C++" code uses. Auto-vectorizers are pretty fragile and require too much babysitting to be worth it.

The_rationalist6y ago

Semi auto vectirizers are the best compromise Cf openMP and also allow to multithread or offload your code on the gpu.

1 more reply

epage6y ago

The_rationalist6y ago

Rust is lagging behind in the performance world, especially because it lack openMP/ACC support.

gameswithgo6y ago

The rayon library provides a lot of the same features as those.

psv16y ago

ekidd6y ago

> To be honest, everything in Rust looks a bit ugly to me.

I write a lot of Rust code at work, and I admit that it can sometimes be pretty noisy. There are several major contributors to this:

gameswithgo6y ago

this is exactly how simd intrinsics look in c, its not rust thing its an intrinsics thing.

pixelpoet6y ago· 2 in thread

> After running benchmarks with all the variants and planets, the improvement is about 9% to 12%.

Pretty weak speedup, maybe a straight up n-body implementation would see closer to the 8x theoretical speedup.

Fronzie6y ago

> but Rust does not provide the Intel _mm256_cos_pd() instruction yet.

tom_mellior6y ago

> > but Rust does not provide the Intel _mm256_cos_pd() instruction yet.

> That might be part of the reason.

qiqitori6y ago· 2 in thread

> AVX functions start with _mm256_

I don't know anything about Rust, but a nicer word is probably "intrinsics". They usually compile to a single instruction.

maeln6y ago

It's because they just use the name of the actual AVX ops (https://software.intel.com/sites/landingpage/IntrinsicsGuide...).

This is a low-level lib. They don't want to hide anything. If you see _mm* you know you are using AVX and which version (which is important to know which CPU is supported).

High level lib do use more natural names.

xiphias26y ago

1 more reply

gameswithgo6y ago

You may enjoy my video tutorial on SIMD Intrinsics as well:

https://www.youtube.com/watch?v=4Gs_CA_vm3o

j / k navigate · click thread line to collapse