RISC-V J extension – Instructions for JITs (opens in new tab)

(github.com)

137 pointsfrankpf4y ago54 comments

54 comments

29 comments · 8 top-level

userbinator4y ago· 10 in thread

I suspect it won't be long before RISC-V becomes not-so-RISC. Even ARM added FJCVTZS.

RISC these days really refers mostly to uniformity with a bit of simplicity bolted on the side. Big instructions sets aren't really avoidable in practice, but the advantage AArch64 and RV64 have over X86 in theory is that they aren't totally insane (e.g. AArch64 is fixed width) and reliant on lots of trickery to preserve a machine model from the 70s.

RISC-V basically eliminates a lot of microarchitectural state (flags), whereas AArch64 updates that state conditionally. We will find out which approach is superior soon.

fredoralive4y ago

Successful architectures seem to need a certain degree of pragmatism. ARM isn't exactly the RISCiest RISC, nor is AMD64 as baroque as the outer limits of CISC like iAPX 432.

FJCVTZS is an example of pragmatism, the JavaScript spec says float to int should be done the way that x86 does it, the original ARM FCVTZS (no J) didn't do it the same way, but JavaScript is so important you have to add a special case.

I hope I'm not mischaracterising the RISC-V side, but I seem to recall their argument against things like FJCVTZS was that that there should be some standard set of instructions that compilers should emit for that special case, and the instruction decoder on high end CPUs should be magic enough to detect the sequence and do optimal things (fused instructions?). Which kinda felt like "we must keep the instruction set as simple as possible, even if it makes the implementation of high performance CPUs complex". See also the "compressed instructions" stuff, which feels again like passing the buck for complexity onto the CPU implementation side (unless it's just a Thumb like 16 bit wide instruction set thing given a misleading name).

jasonwatkinspdx4y ago

So, with RISC-V the design pretty deliberately enables a combination of compressed instructions and macro op fusion.

The compressed instructions are quite lightweight. It's generally an assembly level thing, and the decoder on the cpu side is apparently ~400 gates.

The compressed instructions are indeed a 16 bit wide thing, but fixing some of the flaws in Thumb. Generally they have more implicit operands or operands range over a subset of registers to fit in 16 bits.

But the hat trick is these two dovetail into each other, such that a sequence of compressed instructions can decompress into a fuse-able pair/tuple, which then decodes into a single internal micro op. This creates a way to handle common idioms and special cases without introducing an ever growing number of instructions. Or at least that's the basic claim by the RISC-V folks. I think they've done enough homework on this to not be trivially wrong, so it'll be interesting to see how things go.

sakras4y ago

To be honest I kind of understand this “passing the buck”. In computing in general you never trust the guy up the stack to give you good input. Query engines do filter reordering because they don’t trust the optimizer to get it right. Compilers do optimizations because they don’t trust the programmer to get the order of operations right (rightfully). CPUs do OOO because they don’t trust compilers to get the order of instructions right. The way I see it is there are 2 variants: 1) make a specific instruction (clutters the instruction set, makes processors who don’t care implement it), 2) rely on processors who care to implement instruction fusion, and those who don’t will do it the slow way. Either way, it gets implemented in hardware, and processors who care need to make a change in the front end.

1 more reply

brucehoult4y ago

To defend ARM (what? A RISC-V guy defending ARM?) there is absolutely nothing un-RISC about FJCVTZS. Every instruction set with floating point has some way to convert an FP value to an integer. FJCVTZS is no more complex than the existing FCVTZS -- it simply uses a different rounding mode and different behaviour if the value is too big.

I don't know what you think RISC-V "compressed instruction" means. It's precisely equivalent to ARM Thumb2 -- there are 16 bit opcode and 32 bot opcodes and you can tell which you have by looking at 2 bits (RISC-V) or 3 bits (Thumb2) in the first 16 bits of the instruction.

I don't believe there is any practical "magical" sequence of instructions that could be easily recognised to implement Javascript conversion from float to int. If that is in fact as important as ARM apparently think it is (I have my doubts) then an equivalent of FJCVTZS should be added to RISC-V as an extension.

As for "making the implementation of high performance CPUs complex" … high end CPUs are unavoidably complex. A little bit more is not a big deal. On the other hand, adding complexity to low end CPUs can easily be a complete deal-killer. Splitting an instruction into µops might be a little simpler than combining instructions into macro-ops, but it's not as simple as not having to do it.

Ironically, the people who criticise RISC-V for talking about macro-op fusion seem to be ignorant of the fact that no currently shipping RISC-V SoC does macro-op fusion [1], while every current higher end ARM and X86 does do macro-op fusion of compare (and maybe other ALU) instructions with a following conditional branch instruction.

[1] SiFive U74 can tie together a forward conditional branch over a single integer ALU instruction with that following instruction. They pass down the two execution pipes in parallel (occupying both i.e. they are still two instructions, not a macro-op). The ALU instruction executes regardless, but the conditional branch controls whether the result is written back. i.e. it effectively converts a branch into predication

1 more reply

userbinator4y ago

that there should be some standard set of instructions that compilers should emit for that special case, and the instruction decoder on high end CPUs should be magic enough to detect the sequence and do optimal things (fused instructions?)

Detecting a long fixed sequence of instructions and "compressing" them into one internal operation seems like it would require a lot of fetch bandwidth and/or a really wide decoder. x86 has had macro-fusion since Core Solo/Duo.

1 more reply

olliej4y ago

Ok, I said this elsewhere: FJCVTZS is not special, and while JS may have been a motivating factor, the actual behavior is "emulate the x86 double->int conversion"

There is nothing magic about it.

A more correct name for FJCVTZS would be FXCVTZS. What FJCVTZS does is override the default FPU rounding and signaling results for double to integer conversion to match the x86 behaviour. There is no special logic needed in the FPU, all that happens is instead of the instruction passing the current thread FPU rounding and clamping flags, it passes the flags that exactly match x86 behaviour.

That's it.

Because the JS label is inaccurate everyone believes it to be useless outside of js, when in reality it's useful to anything that needs x86 behavior for double->int conversion, so any x86 emulators on arm (Qemu, presumably the translation runtimes, etc).

God I hate that they named it that.

slaymaker19074y ago

I think the vector operations feel very RISC. One set of operations for the different vector sizes. Another thing to remember is that most of this stuff is an optional part of the ISA.

A good comparison is R7RS with scheme. The vast majority of it are optional RFCs that exist for the sake consistency and aren't implemented by most schemes. The "mandatory" parts are specified via R7RS-small and work is being done on R7RS-large, though even that won't contain every RFC.

I could see us ending up with an equivalent for RISC-V where a common group of extensions get grouped together as a standard (likely including stuff like virtualization support but excluding vector operations).

skavi4y ago

The fastest open source RISC-V core is already technically generating micro ops. (branch to cmov).

goodpoint4y ago

Have you seen how tiny rv64 is compared to x86 and ARM?

snek_case4y ago· 4 in thread

There's a document in there about pointer masking: https://github.com/riscv/riscv-j-extension/blob/master/point...

It seems like the objective of this is to implement different access privileges... but why do you need specialized instructions for this? This is typically done by the OS and memory protection. The pointer masking extension would be to have multiple levels of privilege within a single process? I'm assuming that this is to protect the JIT from a JITted program? Except it's not completely safe, because there might still be bugs in the JIT that could allow messing with the pointer tags. Struggling to think of a real use case.

pjmlp4y ago

Fixing C, hardware memory tagging is the ultimate mitigation strategy for pointer tricks.

Already being successfully used for decades in Solaris SPARC, iOS/macOS and Android are increasingly pushing for it on ARM CPUs, Pluton on Azure Sphere OS,...

snek_case4y ago

I found this post on ARM MTE which was helpful in understanding the concept: https://www.anandtech.com/show/16759/sponsored-post-keep-you...

Seems to me this will have an execution overhead though, and that the best way to improve security would be to finally move beyond C. Most modern languages make buffer overflows impossible.

1 more reply

olliej4y ago

Memory tagging isn't a privilege level thing, it's an anti-compromise mechanism similar to PAC (in the sense the goal is to make it harder for an attacker to compromise code, they are functionally completely different).

The basic idea is you often want finer the page level granularity on memory access rights. An example ARM give in the documentation covering the ARM MTE is an allocator. With memory tagging you can make it so unallocated memory in the allocator is not accessible.

Essentially every piece of memory gets a tag, and you can only access a piece of memory through a pointer that has the matching tag. To illustrate imagine an allocator (which is the example ARM have in the documentation for the ARM MTE)

You the allocator has a bunch of memory, and has all of it set to be tagless (uncolored in ARM terminology IIRC):

    |bbbbbbbbbb|

When you allocator allocates a byte it does the following:

1. Find a free block 2. Choose a tag (randomly if it wants) 3. Set the tag on that memory to the selected tag from (2) 4. returns a pointer to that memory tagged with(2)

So we get something like:

    |1bbbbbbbbb|

    p = (1,0) // pointer with a tag of 1 and the address 0

Now any access to the memory in at address 0 must be via a pointer with the tag 1, and any memory accessed via that pointer must be tagged with 1

So imagine you have a bunch of allocations

    |13251bbbbb|

You can see we've re-used a tag, because there is a finite amount of space for tags in a pointer, so while our original allocation was a 1 byte allocation at 0, we can do p[4] and the access will work. However, if we're choosing the tag randomly and attacker is in theory unlikely to be able to luck out and get the correct tag so your process crashes (it's super important for these mechanisms that any failure results in a unstoppable crash, e.g. no signal handlers or anything). Another thing you allocator does is revert memory to being untagged (or I guess tagged distinctly) on free, so a use after free also cannot work.

In reality the tagging is not per byte because that would be insane: MTE has a significant increase in the physical ram requirements for a system. If you have an N-bit tag, that means you need to have N extra bits in the physical ram for every granule. I don't know what sort of granule sizes people are looking at but the overhead in physical ram requirements is literally (granule size in bits + bits for tag)/(granule size in bits) so you can see how significant this is.

Unlike PAC, my understanding is there is no cryptographic logic linking the tag to pointer, so pointer arithmetic continues to work without overhead whereas in a PAC model p += 1 say would be: temp = AUTH(p), temp = temp + 1, p = SIGN(temp).

The purpose of PAC is not to protect the memory, but rather the pointer itself. For example imagine you have a C++ object, the basic layout is essentially:

    struct {
        void* vtable
        data fields
    }

For those unfamiliar, a vtable is essentially just a list of function pointers to support polymorphism. In this case the vtable pointer is tagged with the appropriate tag for wherever the vtable is. Because the vtable itself is stored in tagged memory it can't be modified by the attacker (in reality tables are all in read only memory, but pretend they're not for this example). But if the attacker can get some random, correctly tagged pointer what they can do is build their own vtable in that memory, and then simply overwrite the vtable pointer with their correctly tagged pointer for the malicious vtable. Of course you can just have the memory holding the object itself also be tagged, so they need the correct pointer tagging for that :D

In the PAC model the pointer is signed by a secret key (it's literally inaccessible to the process) and a nonce (on Mac + iOS this nonce includes the address of the vtable pointer itself). For an attacker to create a valid pointer they need to be able to generate the correct signature over the bits in the pointer and the nonce. Because different nonces are used for pointers in different uses, they can't just get (for example) one object to overwrite another. If the nonce includes the address of the pointer they can't even just copy a validly signed pointer from another location in memory.

I really do like the PAC model a lot, but to me the MTE mechanism seems to be a much stronger protection mechanism, albeit a very expensive one (PAC doesn't require additional ram for the signed pointers).

my1234y ago

Arm MTE uses a 4-bit tag for each 16 bytes region.

1 more reply

aidenn04y ago· 3 in thread

For tagged values, I loved the POWER rlwinm: Rotate Left Word Immediate aNd with Mask (and it's companion rlimi). Pretty much any sane tagging scheme could be converted to the unboxed value with that single instruction; even somewhat exotic tagging schemes like mixing high-bit and low-bit tagging could be handled by it.

Of course in modern architectures being able to do something in one instruction is only tenuously related to being able to do something quickly, but it was a super handy instruction back in the day.

rwmj4y ago

Most tagged arithmetic can be converted to one or two regular instructions. For OCaml which tags the bottom bit I wrote about it here: https://web.archive.org/web/20090810001400/https://caml.inri... and here (scroll down to bottom): https://rwmj.wordpress.com/2009/08/04/ocaml-internals/

KerrAvon4y ago

I heard someone use those instructions once as examples of something compilers could do better than humans writing assembly -- Apple's MPW C compilers for PowerPC were capable of peephole optimizations that would produce them where a human might not think of them. (At least, that was the argument.)

mhh__4y ago

That depends whether you mean a human who knows the instructions exist or not or a human who hasn't worked out how to use shifts to do integers mul/div by 2 yet.

2 more replies

Taniwha4y ago· 2 in thread

It's worth noting that on systems with real cache coherency (MOESI for example) where for example writing data into the dcache to an address A results in cache line shootdown in the icache as part of fetching an 'exclusive/modified' line into the dcache - in this world EXPORT.I is essentially a no-op because what it requires the icache implement (shootdown of icache lines) has already happened naturally.

Equally on such a system the only thing left for FENCE.I to do is to flush any (potentially now bogus) subsequent instructions that are in the execution pipe that might have been prefetched before the writes occurred. In such a system FENCE.I and IMPORT.I are identical.

Hopefully the people writing this spec are listening ... please make sure your spec understands high end systems like this and doesn't add stuff that require special cases in systems that do ubiquitous coherency right

sdbbp4y ago

This organization of functionality is intentional. It provides support for code modification orthogonal to instruction cache coherency support. The range of types of implementations of RISC-V is broad enough that imposing instruction cache coherency on all of them wouldn't be optimal. The I/D consistency proposal provides SW control now, while not requiring particular implementations.

Particular RISC-V Platform specs may end up requiring I/D coherency, like Arm is recommending in SBSA Level 6, but that's left for later, if ever.

Taniwha4y ago

Right, I think it's OK as written, I'm just encouraging people to make general specs rather than ones with special cases that are important for one end but slow everything else down

Decabytes4y ago· 2 in thread

Wasn’t this tried with Jazelle and Java? I wonder how they will overcome the shortcomings of that attempt

pjmlp4y ago

This has been tried plenty of times, ARM just decided something else because reasons.

Also to note that all hardware vendors are adopting hardware memory tagging as the only way to fix C.

Intel messed up with MPX, but I definitely see they coming with an alternative, as I bet they won't like to be seen as the only vendor left without such capabilities.

adgjlsfhk14y ago

I'm honestly not sure why we haven't just admitted C isn't fixable.

2 more replies

olliej4y ago

Counting down to someone pointing at the annoyingly named ARM FJCVTZS instruction. The naming is obviously more about legal problems than reality, but so it goes.

To be very very clear: FJCVTZS does not do anything amazing, clever, or special. The problem it solves is very simple: the behaviour of double->int conversion in JS is the default x86 behaviour. Getting that behaviour on any non-x86 platform is expensive. So a more accurate name would be FXCVTZS. The implementation of FJCVTZS in a CPU is also not expensive, it simply requires passing a specific rounding mode to the FPU for the integer conversion (overriding the default/current global mode), and matching the x86 OOB result.

(Also I really wish people would stop posting to GitHub repos unless the repos have the actual readable spec available or linked, rather than the unbuilt markup version. It just makes reading them annoying.)

ridruejo4y ago

Nice to see Wasm popping up in proposals like this one :)

zogomoox4y ago

I wonder how many future projects will not use RISC-V because middle management will stop reading proposals after the word RISC.

j / k navigate · click thread line to collapse

54 comments

29 comments · 8 top-level

userbinator4y ago· 10 in thread

I suspect it won't be long before RISC-V becomes not-so-RISC. Even ARM added FJCVTZS.

mhh__4y ago

RISC-V basically eliminates a lot of microarchitectural state (flags), whereas AArch64 updates that state conditionally. We will find out which approach is superior soon.

fredoralive4y ago

Successful architectures seem to need a certain degree of pragmatism. ARM isn't exactly the RISCiest RISC, nor is AMD64 as baroque as the outer limits of CISC like iAPX 432.

jasonwatkinspdx4y ago

So, with RISC-V the design pretty deliberately enables a combination of compressed instructions and macro op fusion.

The compressed instructions are quite lightweight. It's generally an assembly level thing, and the decoder on the cpu side is apparently ~400 gates.

sakras4y ago

1 more reply

brucehoult4y ago

1 more reply

userbinator4y ago

1 more reply

olliej4y ago

Ok, I said this elsewhere: FJCVTZS is not special, and while JS may have been a motivating factor, the actual behavior is "emulate the x86 double->int conversion"

There is nothing magic about it.

That's it.

God I hate that they named it that.

slaymaker19074y ago

I think the vector operations feel very RISC. One set of operations for the different vector sizes. Another thing to remember is that most of this stuff is an optional part of the ISA.

skavi4y ago

The fastest open source RISC-V core is already technically generating micro ops. (branch to cmov).

goodpoint4y ago

Have you seen how tiny rv64 is compared to x86 and ARM?

snek_case4y ago· 4 in thread

There's a document in there about pointer masking: https://github.com/riscv/riscv-j-extension/blob/master/point...

pjmlp4y ago

Fixing C, hardware memory tagging is the ultimate mitigation strategy for pointer tricks.

Already being successfully used for decades in Solaris SPARC, iOS/macOS and Android are increasingly pushing for it on ARM CPUs, Pluton on Azure Sphere OS,...

snek_case4y ago

I found this post on ARM MTE which was helpful in understanding the concept: https://www.anandtech.com/show/16759/sponsored-post-keep-you...

Seems to me this will have an execution overhead though, and that the best way to improve security would be to finally move beyond C. Most modern languages make buffer overflows impossible.

1 more reply

olliej4y ago

You the allocator has a bunch of memory, and has all of it set to be tagless (uncolored in ARM terminology IIRC):

    |bbbbbbbbbb|

When you allocator allocates a byte it does the following:

1. Find a free block 2. Choose a tag (randomly if it wants) 3. Set the tag on that memory to the selected tag from (2) 4. returns a pointer to that memory tagged with(2)

So we get something like:

    |1bbbbbbbbb|

    p = (1,0) // pointer with a tag of 1 and the address 0

Now any access to the memory in at address 0 must be via a pointer with the tag 1, and any memory accessed via that pointer must be tagged with 1

So imagine you have a bunch of allocations

    |13251bbbbb|

The purpose of PAC is not to protect the memory, but rather the pointer itself. For example imagine you have a C++ object, the basic layout is essentially:

    struct {
        void* vtable
        data fields
    }

my1234y ago

Arm MTE uses a 4-bit tag for each 16 bytes region.

1 more reply

aidenn04y ago· 3 in thread

Of course in modern architectures being able to do something in one instruction is only tenuously related to being able to do something quickly, but it was a super handy instruction back in the day.

rwmj4y ago

KerrAvon4y ago

mhh__4y ago

That depends whether you mean a human who knows the instructions exist or not or a human who hasn't worked out how to use shifts to do integers mul/div by 2 yet.

2 more replies

Taniwha4y ago· 2 in thread

sdbbp4y ago

Particular RISC-V Platform specs may end up requiring I/D coherency, like Arm is recommending in SBSA Level 6, but that's left for later, if ever.

Taniwha4y ago

Right, I think it's OK as written, I'm just encouraging people to make general specs rather than ones with special cases that are important for one end but slow everything else down

Decabytes4y ago· 2 in thread

Wasn’t this tried with Jazelle and Java? I wonder how they will overcome the shortcomings of that attempt

pjmlp4y ago

This has been tried plenty of times, ARM just decided something else because reasons.

Also to note that all hardware vendors are adopting hardware memory tagging as the only way to fix C.

Intel messed up with MPX, but I definitely see they coming with an alternative, as I bet they won't like to be seen as the only vendor left without such capabilities.

adgjlsfhk14y ago

I'm honestly not sure why we haven't just admitted C isn't fixable.

2 more replies

olliej4y ago

Counting down to someone pointing at the annoyingly named ARM FJCVTZS instruction. The naming is obviously more about legal problems than reality, but so it goes.

ridruejo4y ago

Nice to see Wasm popping up in proposals like this one :)

zogomoox4y ago

I wonder how many future projects will not use RISC-V because middle management will stop reading proposals after the word RISC.

j / k navigate · click thread line to collapse