RISC-V Vector Primer (opens in new tab)

(github.com)

69 pointsoxxoxoxooo3mo ago22 comments

22 comments

timhh2mo ago

If you get lost in the SEW, LMUL, VLMAX, etc. stuff I made a brief explanation here:

https://blog.timhutt.co.uk/riscv-vector/

It has a visualisation of the element selection stuff at the end.

camel-cdr2mo ago

I like this document, but it seems to be written with a very specific implementation in mind.

You can implement both regular SIMD ISAs and scalable SIMD/Vector ISAs in a "Vector processor" style and both in a regular SIMD style.

shash2mo ago

It _is_ RISC-V Vector extensions, so a very specific ISA in mind at the very least. There's another extension (not ratified I think) called Packed SIMD for RISC-V, but this isn't about that.

camel-cdr2mo ago

The name, yes, but going by name is a bad idea as the V in AVX also stands for Vector. BTW, you'll be disappointed if you think of the P extension as something like SSE/AVX. The target for it is way lower power/perf, like a stripped-down MMX.

My point was about the underlying hardware implementation, specifically:

> "As shown in Figure 1-3, array processors scale performance spatially by replicating processing elements, while vector processors scale performance temporally by streaming data through pipelined functional units"

Applies to the hadware implementation, not the ISA, which is not made clear by the text.

You can implement AVX-512 with smaler data path then register width and "scale performance temporally by streaming data through pipelined functional units". Zen4 is a simple example of this, but there is nothing stopping you from implementing AVX-512 on top of heavily temporaly pipelined 64-bit wide execution units.

Similarly, you can implement RVV with a smaller data path than VLEN, but you can also implement it as a bog-standard SIMD processor. The only thing that slightly complicates the comparison is LMUL, but it is fundamentally equivilant to unrolling.

The substantial difference between Vector and SIMD ISAs is imo only the existence of a vl-based predication mechanism. If a SIMD ISA has a fixed register width or not, allowing you to write vector-length agnostic code, is an independent dimension of the ISA design. E .g. the Cray-1 was without a doubt a Vector processor, but the vector registers on all compatible platforms had the exact same length. It did, however, have the mentioned vl-based predication mechanism. You could take AVX10/128, AVX10/256 and AVX10/512, overlap their instruction encodings, and end up with a scalable SIMD ISA, for which you can write vector length agnostic code, but that doesn't make it a Vector ISA any more than it was before.

shash2mo ago

> The name, yes, but going by name is a bad idea as the V in AVX also stands for Vector.

Now I get your point after reading more of the linked page. Yes. It is very implementation specific.

One of the things about RVV (and in general any vector ISA) is that the data path can be different enough between different implementations such that specific rules of thumb for hand tuning most probably won’t carry over. As you say it is true of even sufficiently advanced SIMD architectures like AVX.

actionfromafar2mo ago

Stripped down MMX? What's left then I wonder? :-D

2 more replies

gchadwick2mo ago

Only taken a quick skim but this looks like solid material!

RISC-V Vector is definitely tricky to get a handle on, especially if you just read the architecture documentation (which is to be expected really, good specification for an architecture isn't compatible with a useful beginners guide). I found I needed to look at some presentations given by various members of the vector working group to get a good grasp of the principles.

There's been precious little material beyond the specification and some now slightly old slide decks so this is a great contribution.

veltas2mo ago

Problem RISC-V has is there's no middle-ground.

The specification for an architecture is meant to be useful to anyone writing assembly, not just to people implementing the spec. Case in point x86 manuals aren't meant for Intel, they're meant for Intel's customers.

There is a lot of cope re the fact RISC-V's spec is particularly hard to use for writing assembly or understanding the software model.

If the spec isn't a 'manual' then where's the manual? If there's just no manual then that's a deficiency. If we only have 'tutorial's that's bad as well, a manual is a good reference for an experienced user, and approachable to a slightly aware beginner (or a fresh beginner with experience in other arch's); a tutorial is too verbose to be useful as a regular reference.

Either the spec should have read (and still could read) more like a useful manual, or a useful manual needs to be provided.

geokon2mo ago

On a high level, do I understand correctly that SIMD is close to how the hardware works, while Vector Processor is more of an abstraction? The "Strip Mining" part looks like this translation to something SIMD-like. I seems like it's a good abstraction layers, but there is an implicit compilation step right? (making the "assembly" more easily run on different actual hardware)

Someone2mo ago

> On a high level, do I understand correctly that SIMD is close to how the hardware works, while Vector Processor is more of an abstraction?

Not quite. It still is the same “process whatever number of items you can in parallel, decrease count by that, repeat if necessary“ loop.

RISC-V decided to move the “decrease count by that, repeat if necessary” part into hardware, making the entire phrase “how the hardware works”.

Makes for shorter and nicer assembly. SIMD without it first has to query the CPU to find out how much parallelization it can handle (once) and do the “decrease count by that, repeat if necessary” part on the main CPU.

dzaima2mo ago

RVV still very much requires you to write a manual code/assembly loop doing the "compute how many elements can be handled, decrease count by that, repeat if necessary" thing. All it does is make it slightly less instructions to do so (and also allows handling a loops tail in the same loop while at it).

Joker_vD2mo ago

Yeah, except you don't need to rewrite that code every time a new AVX drops, and also don't need to bother to figure out what to do on older CPUs.

IIRC libc for x64 has several implementations of memcpy/memmov/strlen/etc. for different SSE/AVX extensions, which all get compiled in and shipped to your system; when libc is loaded for the first time, it figures out what is the latest extension the CPU it's running on actually supports and then patches its exports to point to the fastest working implementations.

1 more reply

geokon2mo ago

I mean, "move in to hardware" is effectively more of a micro code translation/compilation step right? The actual silicon implementation of how things are in-the-end going to be executed on the silicon is not fundamentally rearchitected right?

I'm going to try to read through the full document carefully later :)) Likely it's answered in there

noodlesUK2mo ago

This is great!

I’d love a similar document for ARM NEON as well.

crest2mo ago

The Cray-1 frequency is wrong in the graphics iirc it had a 80MHz clock speed (or 12.5ns cycle time).

zozbot2342mo ago

I wonder how this broadly compares with the new ARM64 SVE. Which is easier to adopt?

sylware3mo ago

owww! microsoft github becoming a web app (aka only for the whatng carte web engines), it is impossible to have a 'classic web' look a the repo. Must clone it now... thx microsoft, again.

Joker_vD2mo ago

The example in 1.13 probably would work better if the example with scalar instructions actually had, you know, more instructions than the one with the vector instructions. Otherwise, it's very taxing to read things like "static instruction count and dynamic instruction count both drop dramatically" when your eyes tell you that no, the static instruction count has actually increased.

Also, where does that 38-byte stride even comes from? That number is not even divisible by 4, nevermind by 8!

j / k navigate · click thread line to collapse

22 comments

timhh2mo ago

If you get lost in the SEW, LMUL, VLMAX, etc. stuff I made a brief explanation here:

https://blog.timhutt.co.uk/riscv-vector/

It has a visualisation of the element selection stuff at the end.

camel-cdr2mo ago

I like this document, but it seems to be written with a very specific implementation in mind.

You can implement both regular SIMD ISAs and scalable SIMD/Vector ISAs in a "Vector processor" style and both in a regular SIMD style.

shash2mo ago

It _is_ RISC-V Vector extensions, so a very specific ISA in mind at the very least. There's another extension (not ratified I think) called Packed SIMD for RISC-V, but this isn't about that.

camel-cdr2mo ago

My point was about the underlying hardware implementation, specifically:

Applies to the hadware implementation, not the ISA, which is not made clear by the text.

shash2mo ago

> The name, yes, but going by name is a bad idea as the V in AVX also stands for Vector.

Now I get your point after reading more of the linked page. Yes. It is very implementation specific.

actionfromafar2mo ago

Stripped down MMX? What's left then I wonder? :-D

2 more replies

gchadwick2mo ago

Only taken a quick skim but this looks like solid material!

There's been precious little material beyond the specification and some now slightly old slide decks so this is a great contribution.

veltas2mo ago

Problem RISC-V has is there's no middle-ground.

There is a lot of cope re the fact RISC-V's spec is particularly hard to use for writing assembly or understanding the software model.

Either the spec should have read (and still could read) more like a useful manual, or a useful manual needs to be provided.

geokon2mo ago

Someone2mo ago

> On a high level, do I understand correctly that SIMD is close to how the hardware works, while Vector Processor is more of an abstraction?

Not quite. It still is the same “process whatever number of items you can in parallel, decrease count by that, repeat if necessary“ loop.

RISC-V decided to move the “decrease count by that, repeat if necessary” part into hardware, making the entire phrase “how the hardware works”.

dzaima2mo ago

Joker_vD2mo ago

Yeah, except you don't need to rewrite that code every time a new AVX drops, and also don't need to bother to figure out what to do on older CPUs.

1 more reply

geokon2mo ago

I'm going to try to read through the full document carefully later :)) Likely it's answered in there

noodlesUK2mo ago

This is great!

I’d love a similar document for ARM NEON as well.

crest2mo ago

The Cray-1 frequency is wrong in the graphics iirc it had a 80MHz clock speed (or 12.5ns cycle time).

zozbot2342mo ago

I wonder how this broadly compares with the new ARM64 SVE. Which is easier to adopt?

sylware3mo ago

owww! microsoft github becoming a web app (aka only for the whatng carte web engines), it is impossible to have a 'classic web' look a the repo. Must clone it now... thx microsoft, again.

Joker_vD2mo ago

Also, where does that 38-byte stride even comes from? That number is not even divisible by 4, nevermind by 8!

j / k navigate · click thread line to collapse