Of course in modern architectures being able to do something in one instruction is only tenuously related to being able to do something quickly, but it was a super handy instruction back in the day.
Equally on such a system the only thing left for FENCE.I to do is to flush any (potentially now bogus) subsequent instructions that are in the execution pipe that might have been prefetched before the writes occurred. In such a system FENCE.I and IMPORT.I are identical.
Hopefully the people writing this spec are listening ... please make sure your spec understands high end systems like this and doesn't add stuff that require special cases in systems that do ubiquitous coherency right
Particular RISC-V Platform specs may end up requiring I/D coherency, like Arm is recommending in SBSA Level 6, but that's left for later, if ever.
To be very very clear: FJCVTZS does not do anything amazing, clever, or special. The problem it solves is very simple: the behaviour of double->int conversion in JS is the default x86 behaviour. Getting that behaviour on any non-x86 platform is expensive. So a more accurate name would be FXCVTZS. The implementation of FJCVTZS in a CPU is also not expensive, it simply requires passing a specific rounding mode to the FPU for the integer conversion (overriding the default/current global mode), and matching the x86 OOB result.
(Also I really wish people would stop posting to GitHub repos unless the repos have the actual readable spec available or linked, rather than the unbuilt markup version. It just makes reading them annoying.)
It seems like the objective of this is to implement different access privileges... but why do you need specialized instructions for this? This is typically done by the OS and memory protection. The pointer masking extension would be to have multiple levels of privilege within a single process? I'm assuming that this is to protect the JIT from a JITted program? Except it's not completely safe, because there might still be bugs in the JIT that could allow messing with the pointer tags. Struggling to think of a real use case.
Already being successfully used for decades in Solaris SPARC, iOS/macOS and Android are increasingly pushing for it on ARM CPUs, Pluton on Azure Sphere OS,...
Seems to me this will have an execution overhead though, and that the best way to improve security would be to finally move beyond C. Most modern languages make buffer overflows impossible.
The basic idea is you often want finer the page level granularity on memory access rights. An example ARM give in the documentation covering the ARM MTE is an allocator. With memory tagging you can make it so unallocated memory in the allocator is not accessible.
Essentially every piece of memory gets a tag, and you can only access a piece of memory through a pointer that has the matching tag. To illustrate imagine an allocator (which is the example ARM have in the documentation for the ARM MTE)
You the allocator has a bunch of memory, and has all of it set to be tagless (uncolored in ARM terminology IIRC):
|bbbbbbbbbb|
When you allocator allocates a byte it does the following:1. Find a free block 2. Choose a tag (randomly if it wants) 3. Set the tag on that memory to the selected tag from (2) 4. returns a pointer to that memory tagged with(2)
So we get something like:
|1bbbbbbbbb|
p = (1,0) // pointer with a tag of 1 and the address 0
Now any access to the memory in at address 0 must be via a pointer with the tag 1, and any memory accessed via that pointer must be tagged with 1So imagine you have a bunch of allocations
|13251bbbbb|
You can see we've re-used a tag, because there is a finite amount of space for tags in a pointer, so while our original allocation was a 1 byte allocation at 0, we can do p[4] and the access will work. However, if we're choosing the tag randomly and attacker is in theory unlikely to be able to luck out and get the correct tag so your process crashes (it's super important for these mechanisms that any failure results in a unstoppable crash, e.g. no signal handlers or anything). Another thing you allocator does is revert memory to being untagged (or I guess tagged distinctly) on free, so a use after free also cannot work.In reality the tagging is not per byte because that would be insane: MTE has a significant increase in the physical ram requirements for a system. If you have an N-bit tag, that means you need to have N extra bits in the physical ram for every granule. I don't know what sort of granule sizes people are looking at but the overhead in physical ram requirements is literally (granule size in bits + bits for tag)/(granule size in bits) so you can see how significant this is.
Unlike PAC, my understanding is there is no cryptographic logic linking the tag to pointer, so pointer arithmetic continues to work without overhead whereas in a PAC model p += 1 say would be: temp = AUTH(p), temp = temp + 1, p = SIGN(temp).
The purpose of PAC is not to protect the memory, but rather the pointer itself. For example imagine you have a C++ object, the basic layout is essentially:
struct {
void* vtable
data fields
}
For those unfamiliar, a vtable is essentially just a list of function pointers to support polymorphism. In this case the vtable pointer is tagged with the appropriate tag for wherever the vtable is. Because the vtable itself is stored in tagged memory it can't be modified by the attacker (in reality tables are all in read only memory, but pretend they're not for this example). But if the attacker can get some random, correctly tagged pointer what they can do is build their own vtable in that memory, and then simply overwrite the vtable pointer with their correctly tagged pointer for the malicious vtable. Of course you can just have the memory holding the object itself also be tagged, so they need the correct pointer tagging for that :DIn the PAC model the pointer is signed by a secret key (it's literally inaccessible to the process) and a nonce (on Mac + iOS this nonce includes the address of the vtable pointer itself). For an attacker to create a valid pointer they need to be able to generate the correct signature over the bits in the pointer and the nonce. Because different nonces are used for pointers in different uses, they can't just get (for example) one object to overwrite another. If the nonce includes the address of the pointer they can't even just copy a validly signed pointer from another location in memory.
I really do like the PAC model a lot, but to me the MTE mechanism seems to be a much stronger protection mechanism, albeit a very expensive one (PAC doesn't require additional ram for the signed pointers).
Also to note that all hardware vendors are adopting hardware memory tagging as the only way to fix C.
Intel messed up with MPX, but I definitely see they coming with an alternative, as I bet they won't like to be seen as the only vendor left without such capabilities.
RISC-V basically eliminates a lot of microarchitectural state (flags), whereas AArch64 updates that state conditionally. We will find out which approach is superior soon.
FJCVTZS is an example of pragmatism, the JavaScript spec says float to int should be done the way that x86 does it, the original ARM FCVTZS (no J) didn't do it the same way, but JavaScript is so important you have to add a special case.
I hope I'm not mischaracterising the RISC-V side, but I seem to recall their argument against things like FJCVTZS was that that there should be some standard set of instructions that compilers should emit for that special case, and the instruction decoder on high end CPUs should be magic enough to detect the sequence and do optimal things (fused instructions?). Which kinda felt like "we must keep the instruction set as simple as possible, even if it makes the implementation of high performance CPUs complex". See also the "compressed instructions" stuff, which feels again like passing the buck for complexity onto the CPU implementation side (unless it's just a Thumb like 16 bit wide instruction set thing given a misleading name).
The compressed instructions are quite lightweight. It's generally an assembly level thing, and the decoder on the cpu side is apparently ~400 gates.
The compressed instructions are indeed a 16 bit wide thing, but fixing some of the flaws in Thumb. Generally they have more implicit operands or operands range over a subset of registers to fit in 16 bits.
But the hat trick is these two dovetail into each other, such that a sequence of compressed instructions can decompress into a fuse-able pair/tuple, which then decodes into a single internal micro op. This creates a way to handle common idioms and special cases without introducing an ever growing number of instructions. Or at least that's the basic claim by the RISC-V folks. I think they've done enough homework on this to not be trivially wrong, so it'll be interesting to see how things go.
I don't know what you think RISC-V "compressed instruction" means. It's precisely equivalent to ARM Thumb2 -- there are 16 bit opcode and 32 bot opcodes and you can tell which you have by looking at 2 bits (RISC-V) or 3 bits (Thumb2) in the first 16 bits of the instruction.
I don't believe there is any practical "magical" sequence of instructions that could be easily recognised to implement Javascript conversion from float to int. If that is in fact as important as ARM apparently think it is (I have my doubts) then an equivalent of FJCVTZS should be added to RISC-V as an extension.
As for "making the implementation of high performance CPUs complex" … high end CPUs are unavoidably complex. A little bit more is not a big deal. On the other hand, adding complexity to low end CPUs can easily be a complete deal-killer. Splitting an instruction into µops might be a little simpler than combining instructions into macro-ops, but it's not as simple as not having to do it.
Ironically, the people who criticise RISC-V for talking about macro-op fusion seem to be ignorant of the fact that no currently shipping RISC-V SoC does macro-op fusion [1], while every current higher end ARM and X86 does do macro-op fusion of compare (and maybe other ALU) instructions with a following conditional branch instruction.
[1] SiFive U74 can tie together a forward conditional branch over a single integer ALU instruction with that following instruction. They pass down the two execution pipes in parallel (occupying both i.e. they are still two instructions, not a macro-op). The ALU instruction executes regardless, but the conditional branch controls whether the result is written back. i.e. it effectively converts a branch into predication
Detecting a long fixed sequence of instructions and "compressing" them into one internal operation seems like it would require a lot of fetch bandwidth and/or a really wide decoder. x86 has had macro-fusion since Core Solo/Duo.
There is nothing magic about it.
A more correct name for FJCVTZS would be FXCVTZS. What FJCVTZS does is override the default FPU rounding and signaling results for double to integer conversion to match the x86 behaviour. There is no special logic needed in the FPU, all that happens is instead of the instruction passing the current thread FPU rounding and clamping flags, it passes the flags that exactly match x86 behaviour.
That's it.
Because the JS label is inaccurate everyone believes it to be useless outside of js, when in reality it's useful to anything that needs x86 behavior for double->int conversion, so any x86 emulators on arm (Qemu, presumably the translation runtimes, etc).
God I hate that they named it that.
A good comparison is R7RS with scheme. The vast majority of it are optional RFCs that exist for the sake consistency and aren't implemented by most schemes. The "mandatory" parts are specified via R7RS-small and work is being done on R7RS-large, though even that won't contain every RFC.
I could see us ending up with an equivalent for RISC-V where a common group of extensions get grouped together as a standard (likely including stuff like virtualization support but excluding vector operations).