But ultimately, the gist of their argument is this:
>Any task will require more Risc V instructions that any contemporary instruction set.
Which is easy to verify as utter nonsense. There's not even a need to look at the research, which shows RISC-V as the clear winner in code density. It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.
I am familiar with many tens of instruction sets, since the first computers with vacuum tubes until all the important instruction sets that are still in use, and there is no doubt that RISC-V requires more instructions and a larger code size than almost all of them, for doing any task.
Even the hard-to-believe "research" results published by RISC-V developers have always showed worse code density than ARM, the so-called better results were for the compressed extension, not for the normal encoding.
Moreover, the results for RISC-V are hugely influenced by the programming language and the compiler options that are chosen. RISC-V has an acceptable code size only for unsafe code, if the programming language or the compiler options require run-time checks, to ensure safe behavior, then the RISC-V code size increases enormously, while for other CPUs it barely changes.
The RISC-V ISA has only 1 good feature for code size, the combined compare-and-branch instructions. Because there typically is 1 branch for every 6 to 8 instructions, using 1 instruction instead of 2 saves a lot.
Except for this good feature, the rest of the ISA is full of bad features, which frequently require at least 2 instructions instead of 1 instruction in any other CPU, e.g. the lack of indexed addressing, which is needed in any loop that must access some aggregate data structure, in order to be able to implement the loop with a minimum number of instructions.
>Even the hard-to-believe "research" results published by RISC-V developers have always showed worse code density than ARM
the code size advantage of RISC-V is not artificial academic bullshit. It is real, it is huge, and it is trivial to verify. Just build any non-trivial application from source with a common compiler (such as GCC or LLVM's clang) and compare the sizes you get. Or look at the sizes of binaries in Linux distributions.
>the so-called better results were for the compressed extension, not for the normal encoding.
The C extension can be used anywhere, as long as the CPU supports the extension; most RISC-V profiles require it. This is in stark contrast with ARMv7's thumb, which was a literal separate CPU mode. Effort was put in making this very cheap for the decoder.
The common patterns where number of instructions is larger are made irrelevant by fusion. RISC-V has been thoroughly designed with fusion in mind, and is unique in this regard. It is within its right in calling itself the 5th generation RISC ISA because of this, even if everything else is ignored.
Fusion will turn most of these "2 instructions instead of one" into actually one instruction from the execution unit perspective. There's opportunities everywhere for fusion, the patterns are designed in. The cost of fusion on RISC-V is also very low, often quoted as 400 gates, allowing even simpler microarchitectures to implement it.
https://news.ycombinator.com/item?id=25554865
https://news.ycombinator.com/item?id=25554779
Is the Googrilla search engine really is starting to suck more and more, or is there something else going on in this case?
The threads read more like a incomplete explanation with a polarized view than anything useful for understanding what fusion means in this context.
Overall is give the ranking a score of D-.
This is disingenuous. arm32's Thumb-2 (which has been around since 2003) supports both 16-bit and 32-bit instructions in a single mode, making it directly comparable to RV32C.
And then they get combined in the CPU, right?
Won't those instructions need to be fetched / occupy cache?
Ignoring RISC-V’s compressed encoding seems a rather artificial restriction.
The "C" extension is technically optional, but I'm not aware of anyone who has made or sold a production chip without it -- generally only student projects or tiny cores for FPGAs running very simple programs don't have it.
My estimate is if you have even 200 to 300 instructions in your code it's cheaper to implement "C" than to build the extra SRAM/cache to hold the bigger code without it.
The compressed RISC-V encoding must be compared with the ARMv8-M encoding not with the ARMv8-A.
The base 32-bit RISC-V encoding may be compared with the ARMv8-A, because only it can have comparable performance.
All the comparisons where RISC-V has better code density compare the compressed encoding with the 32-bit ARMv8-A. This is a classical example of apples-to-oranges, because the compressed encoding will never have a performance in the same league with ARMv8-A.
When the comparisons are matched, 16-bit RISC-V encoding with 16-bit ARMv8-M and 32-bit RISC-V with 32-bit ARMv8-A, RISC-V always loses in code density in both comparisons, because only the RISC-V branch instructions are frequently shorter than those of ARM, while all the other instructions are frequently longer.
There are good reasons to use RISC-V for various purposes, where either the lack of royalties or the easy customization of the instruction set are important, but claiming that it should be chosen not because it is cheaper, but because it were better, looks like the story with the sour grapes.
The value of RISC-V is not in its instruction set, because there are thousands of people who could design better ISAs in a week of work.
What is valuable about RISC-V is the set of software tools, compilers, binutils, debuggers etc. While a better ISA can be done in a week, recreating the complete software environment would need years of work.
Which isn't really a big advantage, because ARM and x86 macro-op fuse those instructions together. (That is, those 2-instructions are decoded and executed as 1x macro-op in practice).
cmp /jnz on x86 is like, 4-bytes as well. So 4-bytes on x86 vs 4-bytes on RISC-V. 1-macro-op on x86 vs 1-instruction on RISC-V.
So they're equal in practice.
-----
ARM is 8-bytes, but macro-op decoded. So 1-macro op on ARM but 8-bytes used up.
For x86, cmp/jnz must be 5 bytes for short loops or 9 bytes for long loops, because the REX prefix is normally needed. x86 does not have address modes with auto-update, like ARM or POWER, so for a minimum number of instructions the loop counter must also be used as an index register, to eliminate the instructions for updating the indices.
Because of that, the loop counter must use the full 64-bit register even if it is certain that the loop count would fit in 32-bit. That needs the REX prefix, so the fused instruction pair needs either 5 bytes (for 7-bit branch offsets) or 9 bytes, in comparison with 4 bytes for RISC-V.
So RISC-V gains 1 byte about at every 20 bytes from the branch instructions, i.e. about 5%, but then it loses more than this at other instructions so it ends at a code size larger than Intel/AMD by between 10% and 50%.
You have no idea what you're talking about. I've worked on designs with both ARM and RISC-V cores. The RISC-V code outperforms the ARM core, with smaller gate count, and has similar or higher code density in real world code, depending on the extensions supported. The only way you get much lower code density is without the C extension, but I haven't seen it not implemented in a real-world commercial core, and if it wasn't, I'm sure there was because of a benefit (FPGAs sometimes use ultra-simple cores for some tasks, and don't always care about instruction throughput or density)
It should be said that my experience is in embedded, so yes, it's unsafe code. But the embedded use-case is also the most mature. I wouldn't be surprised if extensions that help with safer programming languages would be added for desktop/server class CPUs, if they haven't already (I haven't followed the development of the spec that closely recently)
> You have no idea what you're talking about.
> It should be said that my experience is in embedded, so yes, it's unsafe code.
Just going based off your reply it certainly sounds like they had at least some idea what they were talking about? In which case omitting that sentence would probably help.
I have no horse in the technical race here, but I certainly am put off from reading what should be an intellectually stimulating discussion by the nature of replies like this.
The linked message is about carry propagation pattern used in gmp. AIU optimized bignum algorithms accumulate carry bits and propagate them in bulk and don't benefit from optimal one bit at a time carry propagation pattern.
What are your thoughts on the way RISC V handled the compressed instructions subset?
Put it side-by-side with Thumb and it also looks pretty similar (thumb has a multiply instruction IIRC).
Put it side-by-side with short x86 instructions accounting for the outdated ones and the list is pretty similar (down to having 8 registers).
All in all, when old and new instruction sets are taking the same approach, you can be reasonably sure it's not the absolute worst choice.
Higher-level languages rely heavily on inlining to reduce their abstraction penalty. Profiles which were taken from the Linux kernel and (checks notes...) Drystone are not representative of code from higher-level languages.
3/4 of the available prefix instruction space was consumed by the 16-bit extension. There have been a couple of proposals showing that even better density could be achieved using only 1/2 the space instead of 3/4, but they were struck down in order to maintain backwards compatibility.
With this extension, RISC-V can be competitive with ARM Cortex-M.
On the other hand, the compressed instruction encoding is useless for general-purpose computers intended as personal computers or as servers, because it limits the achievable performance to much lower levels than for ARMv8-A or Intel/AMD.
It's perfectly possible to have read the spec and disagree with the rationale provided. RISC-V is in fact the outlier among ISAs in many of these design decisions, so there's a heavy burden of proof to demonstrate that making the contrary decisions in many cases was the right call.
> Which is easy to verify as utter nonsense. There's not even a need to look at the research, which shows RISC-V as the clear winner in code density. It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.
This doesn't seem to be true when you actually do an apples-to-apples comparison.
Taking as an example the build of Bash in Debian Sid (https://packages.debian.org/sid/shells/bash). I chose this because I'm pretty confident there's no functional or build-dependency difference that will be relevant here. Other examples like the Linux kernel are harder to compare because the code in question is different across architectures. I saw the same trend in the GCC package, so it's not an isolated example.
riscv64 installed size: 6,157.0 kB amd64 installed size: 6,450.0 kB arm64 installed size: 6,497.0 kB armhf installed size: 6,041.0 kB
RV64 is outperforming the other 64-bit architectures, but under-performing 32-bit ARM. This is consistent with expectations: amd64 has a size penalty due to REX bytes, arm64 got rid of compressed instructions to enable higher performance, and armhf (32-bit) has smaller constants embedded in the binary.
Compressed instructions definitely do work for making code smaller, and that's part of why arm32 has been very successful in the embedded space, and why that space hasn't been rushing to adopt arm64. For arm32, however, compressed instructions proved to be a limiting factor on high performance implementation, and arm64 moved away from them because of it. Maybe that's due to some particular limitations of arm32's compressed instructions that RISC-V compressed instructions won't suffer from, but that remains to be proven.
To compare the code sizes, you need tools like "size", "readelf" etc. and the data given by the tools should still be studied, to see how much of the code sections really contain code.
I have never seen until now a program where the RISC-V variant is smaller than the ARMv8 or Intel/AMD variant, and I doubt very much that such a program can exist. Except for the branches, where RISC-V frequently needs only 4 bytes instead of 5 bytes for Intel/AMD or 8 bytes for ARMv8, for all the other instructions it is very frequent to need 8 bytes for RISC-V instead of 4 bytes for ARMv8.
Moreover, choosing compiler options like -fsanitize for RISC-V increases the number of instructions dramatically, because there is no hardware support for things like overflow detection.
And yet you're quite confident that RISC-V has poor code density. So you clearly have a source of knowledge that others don't. If it's a blog/article/research, could you share a link? If it's personal experimentation, you should write a blog post, I would totally read that.
Genuinely asking, why? Do we think RISC-V should, or even could, try to compete against the AMD/Intel/ARM behemoths on their playing field? Obviously ISAs are a low level detail and far removed from the end product, but it feels like the architectural decisions we are "stuck with" today are inextricably intertwined with their contemporary market conditions and historical happenstance. It feels like all the experimental architectures that lost to x86/ARM (including Intel's own) were simply too much too soon, before ubiquitous internet and the open source culture could establish itself. We've now got companies using genetic algorithms to optimize ICs and people making their own semiconductors in the 100s of microns range in their garages - maybe it's time to rethink some things!
(EE in a past life but little experience designing ICs so I feel like I'm talking out of my rear end)
Well, it's exactly what many RISC-V folks are trying to do. There's news about a new high performance RISC-V core on the HN front page right now!
> but it feels like the architectural decisions we are "stuck with" today are inextricably intertwined with their contemporary market conditions and historical happenstance. It feels like all the experimental architectures that lost to x86/ARM (including Intel's own) were simply too much too soon,
I just want to note that ARM64 was a mostly clean break from prior versions of ARM. Basically a clean slate design started in the late 2000s. It's a modern design built with the same hindsight and approximate market conditions available to the designers of RISC-V.
text data bss dec hex filename
311218 2284 36 313538 4c8c2 arm-linux-gnueabihf/libgmp.so.10
374878 4328 56 379262 5c97e riscv64-linux-gnu/libgmp.so.10
480289 4624 56 484969 76669 aarch64-linux-gnu/libgmp.so.10
511604 4720 72 516396 7e12c x86_64-linux-gnu/libgmp.so.10
Strange, that's not what I see.Fusing instructions isn't just theoretical either. I'm pretty sure it is or will be a common optimisation for CPUs aiming for high performance. How exactly is two easily-fused 16-bit instructions worse than one 32-bit one? Is there really a practical difference other than the name of the instruction(s)?
At the same time, the reduced transistor count you get from a simpler instruction set is not a benefit to be just dismissed either. I'm starting to see RISC-V cores being put all over the place in complex microcontrollers, because they're so damn cheap, yet have very decent performance. I know a guy developing a RISC-V core. He was involved with the proposal for a couple of instructions that would put the code density above Thumb for most code, and the performance of his core was better than Cortex-M0 at a similar or smaller gate count. I'm not sure if the instructions was added to the standard or not though.
Even for high performance CPUs, there's a case to be made for requiring fewer transistors for the base implementation. It makes it easier to make low-power low-leakage cores for the heterogeneous architecture (big.little, M1, etc.) which is becoming so popular.
Funny, I thought the whole thing was bitching that RISC V has no carry flag which obviously causes multi word arithmetic to take more instructions. The obvious workaround is to use half-words and use the upper half for carry. There may be better solutions, but at twice the number of instructions this "dumb" method is better than what the author did.
Flags were removed because they cause a lot of unwanted dependencies and contention in hardware designs and they aren't even part of any high level language.
I still think instead of compare-and-branch they should have made "if" which would execute the following instruction only if true. But that's just just an opinion. I also hate the immediate constants (12 bits?) Inside the instruction. Nothing wrong with 16 32 or 64bit immediate data after the opcode.
I hope RISC 6 will come along down the road (not soon) and fix a few things. But I like the lack of flags...
Risc-v basically says "lets make the implicit, explicit" and you have to essentially use registers to store the carry information when operating on bigints. Which for the current impl means chaining more instructions.
Is that correct?
That sounds like what the FP crowd is always talking about - eschewing shared state so it's easier to reason about, optimize, parallelize, etc.
Nevertheless, the ISA speaks for itself. The goal of a technical project is to produce a technical artifact, not to generate good feelings having followed a "solid" process.
If the process you followed brought you to this, of what use was the process?
Also, the godbolt.org compiler explorer has Risc-V support: useful for someone interested in comparing specific snippets of code.
https://en.wikipedia.org/wiki/Reduced_instruction_set_comput...
Instruction fusion has no effect on code size, but only on execution speed.
For example RISC-V has combined compare-and-branch instructions, while the Intel/AMD ISA does not have such instructions, but all Intel & AMD CPUs fuse the compare and branch instruction pairs.
So there is no speed difference, but the separate compare and branch instructions of Intel/AMD remain longer at 5 bytes, instead of the 4 bytes of RISC-V.
Unfortunately for RISC-V, this is the only example favorable for it, because for a large number of ARM or Intel/AMD instructions RISC-V needs a pair of instructions or even more instructions.
Fusing instructions will not help RISC-V with the code density, but it is the only way available for RISC-V to match the speed of other CPUs.
Even if instruction fusion can enable an adequate speed, implementing such decoders is more expensive than implementing decoders for an ISA that does not need instruction fusion for the same performance
The context here is the implementation of one of the inner loops of a high-performance infinite-precision arithmetic library (GMP), in RISCV the loop has 3x the instruction count it has in competing architectures.
“The compiler” is not relevant, this is by design stuff that the compiler is not supposed to touch because it’s unlikely to have the necessary understanding to get it as tight and efficient as possible.
Wait, if we are talking about actual ISA instructions, why is it hard to believe that RISC-V would have more of them ? The argument in favor of RISC is to simplify the frontend because even for a complex ISA like x86, the instructions will get converted to many micro-ops. In terms of actual ISA instructions, it seems quite reasonable that x86 would have fewer of those (at the cost of frontend complexity).
Doing it using a small pool of instructions, too (as RISC-V does), is the cherry on top.
Therein lies the problem. Nobody ever goes out guns blazing complaining about too many instructions despite the fact that complexity has its own downsides.
RISC-V has been designed aggressively to have minimal ISA to leave plenty of room to grow, and require minimal number of transistors for a minimal solution.
Should this be a showstopper down the road, then there will be plenty of space to add an extensions that fixes this problem. Meanwhile embedded systems paying a premium for transistors are not going to have to pay for these extra instructions as only 47 instructions have to be implemented in a minimal solution.
I think in 10-20 years everyone will agree that all the "bad" RISC-V decisions don't matter. The same way x86 (CISC) was supposed to be bad because of legacy/backwards compatibility.
It's a trade off - and the one that's been made is one that makes it possible to make ALL instructions a little faster at the expense of one particular case that isn't used much - that's how you do computer architecture, you look at the whole, not just one particular case
RISCV also specifies a 128-bit variant that is of course FASTER than these examples
I wish there was a way out.
Language features are also often implemented at least partly because they can be done efficiently on the premiere hardware for the language. Then new hardware can make such features hard to implement.
WASM implemented return values in a way that was different from register hardware, and it makes efficient codegen of Common Lisp more challenging. This was brought to the attention of the committee while WASM was still in flux, and they (perhaps rightfully) decided CL was insufficiently important to change things.
I'm sure that people brought up the overflow situation to the RISC-V designers, and it was similarly dismissed. It's just unfortunate that legacy software is such a big driver of CPU features as that's a race towards lowest-common-denominator hardware.
Main one is interrupt on overflow.
Can you refresh my memory here? What exactly is different about Wasm return values than any other function-oriented language?
That's probably not true in the usual case. Most arch's are 64 bit nowadays. If you are working on something that isn't 64 it you are doing embedded stuff, and different rules and coding standards apply (like using embedded assembler rather than pure C or Rust). In 64 bit environments only pointers are 64 bits by default, almost all integers remain 32 bit. Checking for a 32 bit overflow on a 64 bit RISC-V machine takes the same amount of instructions as everywhere else. Also, in C integers are very common because they are used as iterators (ie, stepping along things in for loops). But in Rust, iterators replace integers for this sort thing. There still is an integer under the hood of course, and perhaps it will be bounds checked. But that is bounds checked - not overflow checked. 2^32 is far larger than most data structures in use. Which means while there may be some code bloat, the lack full 64 integers in your average Rust problem means it's going to be pretty rare.
Since I'm here, I'll comment on the article. It's true the lack of carry will make adds a little more difficult for multi precision libraries. But - I've written a multi precision library, and the adds are the least of your problems. Adds just generate 1 bit of carry. Multiplies generate an entire word of carry, and they almost a common as adds. Divides are no so common fortunately, but the execution time of just one divide will make all the overhead caused by a lack of carry look like insignificant noise.
I'm no CPU architect, but I gather the lack of carry and overflow bits makes life a little easier for just about every instruction other than adc and jo. If that's true, I'd be very surprised if the cumulative effect of those little gains didn't completely overwhelm the wins adc and jo gets from having them. Have a look at the code generated by a compiler some time. You will have a hard time spotting the adc's and jo's because there are bugger all of them.
That said, I think it's less of an issue these days for JS implementors in particular. It might have mattered more back in the day when pure JS carried a lot of numeric compute load and there weren't other options. These days it's better to stow that compute code in wasm and get predictable reliable performance and move on.
The big pain points in perf optimization for JS is objects and their representation, functions and their various type-specializations.
Another factor is that JS impls use int32s as their internal integer representation, so there should be some relatively straightforward approach involving lifting to int64s and testing the high half for overflow.
Still kind of cumbersome.
There are similar issues in existing ISAs. NaN-boxing for example uses high bits to store type info for boxed values. Unboxing boxed values on amd64 involves loading an 8-byte constant into a free register and then using that to mask out the type. The register usage is mandatory because you can't use 64-bit values as immediates.
I remember trying to reduce code size and improve perf (and save a scratch register) by turning that into a left-shift right-shift sequence involving no constants, but that led to the code executing measurably slower as it introduced data dependencies.
It just feels backwards to me to increase the cost of these checks in a time where we have realized that unchecked arithmetic is not a good idea in general.
If desktop/server-class RISC-V CPUs become more common, it's not unreasonable to think they'll add an extension that covers the needs of managed/higher-level languages like RISC-V.
Even for server-class CPUs you could argue that you absolutely want this extension to be optional, as you can design more efficient CPUs for datacenters/supercomputers where you know what kind of code you'll be running.
So anyone who thinks about an efficient hardware implementation would expose the overflow bit to the software.
A hardware implementation that requires multiple additions to provide the complete result of a single addition can be called in many ways, but certainly not "efficient".
I would like to see some benchmarks of this efficient implementation in hardware, even simulated hardware, compared against conventional architectures.
Even for C, it's a recurring source of bugs and vulnerabilities that int overflow goes undetected. What we really need is an overflow trap like the one in IEEE floating point. RISC-V went the opposite direction.
Where this really bites you is in workloads dominated by tight loops (image processing, cryptography, HPC, etc). While a microarchitecture may be more efficient thanks to simpler instructions (ignoring the added complexity of compressed instructions and macro-fusion, the usual suggested fixes...), it's not going to be 2-3x faster, so it's never going to compensate for a 2-3x larger inner loop.
Instruction decoding and memory ordering can be a bit of nightmare on CISC ISAs and fewer macro-instructions are not automatically a win. I guess we'll eventually see in benchmarks.
Even though Intel has had decades to refine their CPUs I'm quite excited to see where RISC-V is going.
Macro fusion definitely has a place in microarchitecture performance, especially when you have to deal with a legacy ISA. RISC-V makes the very unusual choice of depending on it for performance, when most ISAs prefer to fix the problem upstream.
This is technically true but not really. Decoding into many instructions is mainly used for compatibility with the crufty parts of the x86 spec. In general, for anything other than rmw or locking a competent compiler or assembly writer will only very rarely emit instructions that compile to more than one uop. The way the frontend works, microcoded instructions are extraordinarily slow on real cpus.
Modern x86 is basically a risc with a very complex decode, few extra useful complex operations tacked on, and piles and piles of old moldy cruft that no-one should ever touch.
As someone else who replied said, I'm not a CPU architect, just software that works close to the metal. That means I pay attention to compiler output.
What you say is true in the very early says: compilers did indeed use the x86's addressing modes in all sorts of odd ways to squeeze as many calculations as possible into as few bytes as possible. Then it went in the reverse direction. You started seeing compilers emitting long series of simple instructions instead, seemingly deliberately avoiding those complex addressing modes. And now it's swung back again - I'm the complier using addressing modes to shift plus a couple of adds in one instruction is common again. I presume all these shifts were driven by speed of the resulting code.
I have no idea why one method was faster than the other - but clearly there is no hard and fast rule operating here. For some internal x86's implementations using complex addressing modes was a win. On some, for exactly the same instruction set, it wasn't. There is no cut and dried "best" way of doing it, rather it varies as the transistor and power budget changes.
One thing we do know about RISC-V is it is intended to cover a _lot_ transistor and power budgets. Where it's used now (low power / low transistor) is turned out their design decisions have turned out _very_ well, far better than x86.
More fascinatingly to me, today the biggest speed ups compilers get for super scalar arch's has nothing to do with the addressing modes so much attention is being focused on here. It comes from avoiding conditional jumps. The compilers will often emit code that evaluates both paths of the computation (thus burning 50% more ALU time on a calculating a result that will never be used), then choose the result they want with a cmov. In extreme cases, I've seen doing that sort of thing gain them a factor of 10, which is far more than playing tiddly winks with addressing modes will get you.
I have no idea how that will pan out for RISC-V. I don't think any one has done a super scalar implementation of it yet(?) But in the non-super scalar implementations the RISC-V instruction set choices have worked out very well so far. And when someone does do a super scalar implementation (and I'm sure there will be a lot of different implementations over time), it seems very possible x86's learnings on addressing mode use will be yesterdays news.
The arithmetic instructions, e.g. addition or multiplication, do not encode a field for where to store the flags, so they use, like you said, an implicit destination, which is still different for integer and floating-point.
In large out-of-order CPUs, with flag register renaming, this is no longer so important, but in 1990, when POWER was introduced, the multiple sets of flags were a great advance, because they enabled the parallel execution of many instructions even in CPUs much simpler than today.
Besides POWER, the 64-bit ARMv8 also provides most of the 14 predicates that exist for a partial order relation. For some weird reason, the IEEE FP standard requires only 12 of the 14 predicates, so ARM implemented just those 12, even if they have 14 encodings, by using duplicate encodings for a pair of predicates.
I consider this stupid, because there would not have been any additional cost to gate correctly the missing predicate pair, even if it is indeed one that is only seldom needed (distinguishing between less-or-greater and equal-or-unordered).
It's doesn't have to be _that_ bad. As long as condition flags are all written at once (or are essentially banked like PowerPC) the dependency issue can go away because they're renamed and their results aren't dependent on previous data.
Now, of course, instructions that only update some condition flags and preserve others are the devil.
> all those extra instructions to compute carry will blow the I$ faster
i think the idea is, as others have mentioned, the add/comp instructions are fused internally to a single instruction, so probably its not that bad for i$ as we might think?Is it actually implemented on any hardware?
It all seem hypothetical to me now, fast cores would fuse the instructions together so instruction count alone isn't adequate for the original evaluation of the ISA. Now I'm not sure that there are any that really do that..
When you hear the "<person / group> could make a better <implementation> in <short time period>" - call them out. Do it. The world will not shun a better open license ISA. We even have some pretty awesome FPGA boards these days that would allow you to prototype your own ISA at home.
In terms of the market - now is an exceptionally great time to go back to the design room. It's not as if anybody will be manufacturing much during the next year with all of the fab labs unable to make existing chips to meet demand. There is a window of opportunity here.
> It is, more-or-less a watered down version of the 30 year old Alpha ISA after all. (Alpha made sense at its time, with the transistor budget available at the time.)
As I see it, lower numbers of transistors could also be a good thing. It seems blatantly obvious at this point that multi-core software is not only here to stay, but is the future. Lower numbers of transistors means squeezing more cores onto the same silicon, or implementing larger caches, etc.
I also really like the Unix philosophy of doing one simple thing well. Sure, it could have some special instruction that does exactly your use case in one cycle using all the registers, but that's not what has created such advances in general purpose computing.
> Sure, it is "clean" but just to make it clean, there was no reason to be naive.
I would much rather we build upon a conceptually clean instruction set, rather than trying to hobble together hacks on top of fundamentally flawed designs - even at the cost of performance. It's exactly these hobbled conceptual hacks that have lead to the likes of spectre and meltdown vulnerabilities, when the instruction sets become so complicate that they cannot be easily tested.
But the author making an argument like that...
> I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project.
Pretty much blew their credibility. It's obviously wrong, and a sensible, fair person wouldn't write it.
When you use a CPU architecture you don't just get an ISA.
You also get compilers and debuggers. Ready-to-run Linux images. JIT compilers for JavaScript and Java. Debian repos and Python wheels with binaries.
And you get CPUs with all the most complex features. Instruction re-ordering, branch prediction, multiple cores, multi-level caches, dynamic frequency and voltage control. You want an onboard GPU, with hardware 4k h264 encoding and decoding? No problem.
And you get a wealth of community knowledge - there are forum posts and StackOverflow questions where people might have encountered your problems before. If you're hiring, there are loads of engineers who've done a bit of stuff with that architecture before. And of course vendors actually making the silicon!
I've seen ISAs documented with a single sheet of A4 paper. The difficult part in having a successful CPU architecture is all the other stuff :)
How about some 32 way SMT GPUs... No more divergence!
That allows more flexibility for CPU designs to optimize transistor count vs speed vs energy consumption.
This guy clearly did not look at the stated rationale for the design decisions of RISC-V.
Beyond that, compressed instructions are not a 1:1 substitute for more complex instructions, because a pair of compressed instructions cannot have any fields that cross the 16-bit boundary. This means you can't recover things like larger load/store offsets.
Additionally, you can't discard architectural state changes due to the first instruction. If you want to fuse an address computation with a load, you still have to write the new address to the register destination of the address computation. If you want to perform clever fusion for carry propagation, you still have to perform all of the GPR writes. This is work that a more complex instruction simply wouldn't have to perform, and again it complicates a high performance implementation.
They spent a lot of time and effort on making sure the decoding pretty good and useful for high performance implementations.
RISC-V is designed for very small and very large system. At some point some tradeoffs need to be made but these are very reasonable and most of the time no a huge problem.
For the really specialized cases where you simply can't live with those extra instruction, those will be added to the standard and then some profiles will include them and others not. If those instructions are really as vital as those that want them claim, they will find their way into many profiles.
Saying RISC-V is 'terrible' because of those choices is not fair way of evaluating it.
That's exactly the problem --- there is no one-size-fits-all when it comes to instruction set design.
Besides that, you raise good points on sources of complexity. I’m waiting for the benchmarks once such developments have been incorporated. Everything else is guesswork.
More difficult than x86? We're talking about a damn simple variable width decoding here.
I could imagine RISC-V with C extension being more tricky than 64-bit ARM. Maybe.
> and again it complicates a high performance implementation.
But so much of the rationale behind the design of RISC-V is to simplify high performance implementation in other ways. So the big question is what the net effect is.
The other big question is if extensions will be added to optimise for desktop/server workloads by the time RISC-V CPUs penetrate that market significantly.
Of course you discard architectural state changes in fusion. If I have a bunch of instructions which end up reading from memory into register x10, then I can fuse with all previous instructions which wrote into x10, as their results get clobbered anyway.
Disclaimer: I may have misunderstood the point you made. However you don’t seem to make it clear how fusion is bad for performance.
What performance tricks are you giving up by doing fusion?
> I have heard that Risc V proponents say that these problems are known and could be fixed by having the hardware fuse dependent instructions. Perhaps that could lessen the instruction set shortcomings, but will it fix the 3x worse performance for cases like the one outlined here?
Macro-fusion can to some extent offset the weak instruction set, but you're never going to get a multiple integer multiplier speedup out of it given the complexity of inter-op architectural state changes that have to be preserved, and instruction boundary limitations involved; it's never going to offset a 3x blowup in instruction count in a tight loop.
Also, it's said that x86 is bad because the instructions are then reorganized and translated inside the CPU. But it seems that you are proposing the same, the CPU that preprocessed the instructions and fuses some into a single one (the opposite that x86 does). Ad that point, it seems to me that what x86 does makes more sense: have a ton of instruction (and thus smaller programs and thus more code that can fit in cache) and split them, rather than having a ton of instructions (and waste cache space) for then the CPU to combine them into a single one (a thing that a compiler can also do).
Anyway what you gain from this is a very simple ISA, which helps tool writers, those who implement hardware as well in academia for teaching and research.
How does the insanely complex x86 instructions help anyone?
Also don't reason with the desktop or server use case in mind, where you have TB of disk and code size doesn't matter. RISC-V is meant to be used also for embedded systems (in fact their use nowadays is only for these systems), where usually code size matter more than performance (i.e. you typically compile with -Os). In these situations more instructions means more flash space wasted, meaning you can fit less code.
RISC-V has a number of places it's employed where it makes an excellent fit. First of all academia. For an undergrad making building the netlist for their first processor or a grad student doing their first out of order processor RISC-V's simplicity is great for the pedagogical purpose. For a researcher trying to experiment with better branch prediction techniques having a standard high-ish performance open source design they can take and modify with their ideas is immensely helpful. And for many companies in the real world with their eyes on the bottom line like having an ISA where you can add instructions that happen to accelerate your own particular workload, where you can use a standard compiler framework outside your special assembly inner loops, and where you don't have to spend transistors on features you don't need.
I'm not optimistic about RISC-V's widescale adoption as an application processor. If I were going to start designing an open source processor in that space I'd probably start with IBM's now open Power ISA. But there are so many more niches in the world than just that and RISC-V is already a success in some of them.
Kinda stopped reading here. It's a pretty arrogant hot take. I don't know this guy, maybe he's some sort of ISA expert. But it strains credulity that after all this time and work put into it, RISC-V is a "terrible architecture".
My expectation here is that RISC-V requires some inefficient instruction sequences in some corners somewhere (and one of these corners happens to be OP's pet use case), but by and large things are fine.
And even then, I don't think that's clear. You're not going to determine performance just by looking at a stream of instructions on modern CPUs. Hell, it's really hard to compare streams of instructions from different ISAs.
Seems quite balanced with all the other replies here which claim it's the best architecture ever whenever anyone says anything about it.
I don't think its vector extensions would be good for video codecs because they seem designed around large vectors. (and the article the designers wrote about it was quite insulting to regular SIMD)
RISC-V is pretty good. Probably slightly better for some things than ARM, and slightly worse for others. It's open, which is awesome, and the instruction set lends itself to extensions which is nice (but possibly risks the ecosystem fragmenting). Building really high performance RISC-V designs looks like it's going to rely on slightly smarter instruction decoders than we've seen in the past for RISCs, but it doesn't look insurmountable.
Bad? Quite possible, it was meant as a teaching ISA initially IIRC, but terrible? Who knows.
If you look at the early history of RISC-V, it does indeed look like as something built for teaching. But I don't think that use case warrants all the hype around it.
So how did all the hype form, and why is it that there are people seemingly hyping it as the next-gen dream-come-true super elegant open developed-with-hindsight ISA that will eventually displace crufty old x86 and proprietary ARM while offering better performance and better everything? Of course that just baits you into arguing about its potential performance. And don't worry if it doesn't have all the instructions you need for performance yet, we'll just slap it with another extension and it totally won't turn into a clusterfuck with a stench of legacy and numerous attempts at fixing it (coz' remember, hindsight)!
And then if you question its potential, you'll get someone else arguing that no no, it's not a high performance ISA for general use in desktops / servers, it's just an extensible ISA that companies can customize for their special sauce microcontrollers or whatever.
Of course it's all armchair speculation because there are no high performance real world implementations and there aren't enough experts you can trust.
typedef __int128_t int128_t;
int128_t add(int128_t left, int128_t right)
{
return left + right;
}
GCC 10, -O2, RISC-V: add(__int128, __int128):
mv a5,a0
add a0,a0,a2
sltu a5,a0,a5
add a1,a1,a3
add a1,a5,a1
ret
ARM64: add(__int128, __int128):
adds x0, x0, x2
adc x1, x1, x3
ret
This issue hurts the wider types that are compiler built-ins.Even though C has a programming model that is devoid of any carry flag concept, canned types like a 128 bit integer can take advantage of it.
Portable C code to simulate a 128 bit integer will probably emit bad code across the board. The code will explicitly calculate the carry as an additional operand and pull it into the result. The RISC-V won't look any worse, then, in all likelihood.
(The above RISC-V instruction set sequence is shorter than the mailing list post author's 7 line sequence because it doesn't calculate a carry out: the result is truncated. You'd need a carry out to continue a wider addition.)
2-instructions to work with 64-bits, maybe 1 more instruction / macro-op for the compare-and-jump back up to a loop, and 1 more instruction for a loop counter of somekind?
So we're looking at ~4 instructions for 64-bits on ARM/x86, but ~9-instructions on RISC-V.
The loop will be performed in parallel in practice however due to Out-of-order / superscalar execution, so the discussion inside the post (2 instruction on x86 vs 7-instructions on RISC-V) probably is the closest to the truth.
----------
Question: is ~2-clock ticks per 64-bits really the ideal? I don't think so. It seems to me that bignum arithmetic is easily SIMD. Carries are NOT accounted for in x86 AVX or ARM NEON instructions, so x86, ARM, and RISC-V will probably be best.
I don't know exactly how to write a bignum addition loop in AVX off the top of my head. But I'd assume it'd be similar to the 7-instructions listed here, except... using 256-bit AVX-registers or 512-bit AVX512 registers.
So 7-instructions to perform 512-bits of bignum addition is 73-bits-per-clock cycle, far superior in speed to the 32-bits-per-clock cycle from add + adc (the 64-bit code with implicit condition codes).
AVX512 is uncommon, but AVX (256-bit) is common on x86 at least: leading to ~36-bits-per-clock tick.
----------
ARM has SVE, which is ambiguous (sometimes 128-bits, sometimes 512-bits). RISC-V has a bunch of competing vector instructions.
..........
Ultimately, I'm not convinced that the add + adc methodology here is best anymore for bignums. With a wide-enough vector, it seems more important to bring forth big 256-bit or 512-bit vector instructions for this use case?
EDIT: How many bits is the typical bignum? I think add+adc probably is best for 128, 256, or maybe even 512-bits. But moving up to 1024, 2048, or 4096 bits, SIMD might win out (hard to say without me writing code, but just a hunch).
2048-bit RSA is the common bignum, right? Any other bignums that are commonly used? EDIT2: Now that I think of it, addition isn't the common operation in RSA, but instead multiplication (and division which is based on multiplication).
There is only one standard V extension. Alibaba made a chip with a prerelease version of that V extension which is thus incompatible with the final version, but in practice that just means that the vector unit on that chip is not used because it is incompatible, not that there are now competing standards
add+adc should still be 64 bits per cycle. adc doesn't just add the carry bit, it's an add instruction which includes the usual operands, plus the carry bit from the previous add or adc.
Which is why I'm sure add / adc will still win at 128-bits, or 256-bits.
The main issue is that the vector-add instructions are missing carry-out entirely, so recreating the carry will be expensive. But with a big enough number, that carry propagation is parallelizable in log2(n), so a big enough bignum (like maybe 1024-bits) will probably be more efficient for SIMD.
MIPS didn't have a flag register either and depended on a dedicated zero register and slt instructions (set if less than)
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-...
MIPS is classical RISC design that was not designed to be OoO-friendly at all and is simply designed for ease of straightforward pipelined implementation. The reason why it does not have flags probably simply comes down to the observation that you don't need flags for C.
Edit: Don't get me wrong, I don't think RISC-V is "garbage" or anything like that. I just think it could have been better. But of course, most of an architecture's value comes from its ecosystem and the time spent optimizing and tailoring everything...
What sticks in my mind from my limited exposure to SuperH is that there's no load immediate instruction, so you have to do a PC-relative load instead. It was clearly optimized for compiled rather than handwritten code!
SuperH has a mov #imm, Rx that can take an 8-bit #imm. But you're right, literal pools were used just like on ARM.
Things I liked about SuperH: 16 bit fixed-width insn format (except for some SH2A and DSP ops), T flag for bit manipulation ops, GBR to enable scaled loads with offset, xtrct instruction, single-cycle division insns (div0, div1), MAC insns.
In terms of code density SH was quite effective, see here http://web.eece.maine.edu/~vweaver/papers/iccd09/iccd09_dens... or here http://www.deater.net/weave/vmwprod/asm/ll/ll.html
Not having anything that stands out is perhaps a good thing. Being "clever" with the ISA tends to bite you when implementing OoO superscalar cores.
You can detect carry of (a+b) in C branch-free with: ((a&b) | ((a|b) & ~(a+b))) >> 31
So 64-bit add in C is:
f_low = a_low + b_low
c_high = ((a_low & b_low) | ((a_low | b_low) & ~f_low)) >> 31
f_high = a_high + b_high + c_high
So for RISC-V in gcc 8.2.0 with -O2 -S -c add a1,a3,a2
or a5,a3,a2
not a7,a1
and a5,a5,a7
and a3,a3,a2
or a5,a5,a3
srli a5,a5,31
add a4,a4,a6
add a4,a4,a5
But for ARM I get (with gcc 9.3.1): add ip, r2, r1
orr r3, r2, r1
and r1, r1, r2
bic r3, r3, ip
orr r3, r3, r1
lsr r3, r3, #31
add r2, r2, lr
add r2, r2, r3
It's shorter because ARM has bic. Neither one figures out to use carry related instructions.Ah! But! There is a gcc macro: __builtin_uadd_overflow() that replaces the first two C lines above: c_high = __builtin_uadd_overflow(a_low, b_low, &f_low);
So with this:
RISC-V:
add a3,a4,a3
sltu a4,a3,a4
add a5,a5,a2
add a5,a5,a4
ARM: adds r2, r3, r2
movcs r1, #1
movcc r1, #0
add r3, r3, ip
add r3, r3, r1
RISC-V is faster..EDIT: CLANG has one better: __builtin_addc().
f_low = __builtin_addcl(a_low, b_low, 0, &c);
f_high = __builtin_addcl(a_high, b_high, c, &junk);
x86: addl 8(%rdi), %eax
adcl 4(%rdi), %ecx
ARM: adds w8, w8, w10
add w9, w11, w9
cinc w9, w9, hs
RISC-V: add a1, a4, a5
add a6, a2, a3
sltu a2, a2, a3
add a6, a6, a2I find it funny that you make the same pitfall than the author did.
Faster on which CPU?
The author doesn't measure on any CPU, so here there are dozens of people hypothesizing whether fusion happens or not, and what the impact is.
Counting number of instructions isn't really a good metric for that either.
Perhaps faster means fewer instructions in this instance? Considering number of instructions is what has been discussed.
Same for code size. If the instructions are half the size, having 1.5x more instructions still means smaller binaries.
In addition to the actual ALU instructions doing the add with carry, for bignums it's important to include the load and store instructions. Even in L1 cache it's typically 2 or 3 or 4 cycles to do the load, which makes one or two extra instructions for the arithmetic less important. Once you get to bignums large enough to stream from RAM (e.g. calculating pi to a few billion digits) it's completely irrelevant.
This especially applies to potentially controversial things.
Overall, I feel HN is most fun when a lot of people are in disagreement but also operating in good faith.
But I agree that this bit of writing comes across as a bit overly assertive and arrogant; and probably trivially proved wrong by actually running some benchmarks.
By the same reasoning, the Apple M1 would obviously be slower than anything Intel and AMD produce given similar energy and transistor density constraints (i.e. same class of hardware). Except that obviously isn't the case and we have the Macbook air with the M1 more than holding up against much more expensive Intel/AMD chips. Reason: chips don't actually work like this person seems to assume. The whole article is a sandcastle of bad assumptions leading up to an arrogantly worded & wrong conclusion.
You do not criticise The Rusted Holy Grail and the Riscy Silver Bullet.
Many people still think that RISC-V implies an open source implementation, for example.
The minimum duration of the clock cycle of a modern CPU is essentially determined by the duration of a 64-bit integer addition/subtraction, because such operations need a latency of only 1 clock cycle to be useful.
Operations that are more complex than 64-bit integer addition/subtraction, e.g. integer multiplications or floating-point operations, need multiple cycles, but they are pipelined so that their throughput remains at 1 per cycle.
So 64-bit addition/subtraction is certainly expected to be included in any RISC ISA.
The hardware adders used for addition/subtraction provide, at a negligible additional cost, 2 extra bits, carry and overflow, which are needed for operations with large integers and for safe operations with 64-bit integers.
The problem is that the RISC-V ISA does not offer access to those 2 bits and generating them in software requires a very large cost in execution time and in lost energy in comparison with generating them in hardware.
I do not see any relationship between these bits and the RISC concepts, omitting them does not simplify the hardware, but it makes the software more complex and inefficient.
Edit: Another place you see this kind of arthimetic is crypto, but those specific use cases (Diffie Hellman, RSA, a few others) don't tend to be vectorized. You have one op you're trying to work through with large integers, and there's the carry dependency on each partial op. The carry depdent crypto algorithms aren't typically vectorisable.
My code snippet results in bloated code for RISC-V RV64I.
I'm not sure how bloated it is. All of those instructions will compress [1].[1] https://riscv.org/wp-content/uploads/2015/05/riscv-compresse...
It's slower on RISC-V but not a lot on a superscalar. The x86 and ARMv8 snippets have 2 cycles of latency. The RISC-V has 4 cycles of latency.
1. add t0, a4, a6 add t1, a5, a7
2. sltu t6, t0, a4 sltu t2, t1, a5
3. add t4, t1, t6 sltu t3, t4, t1
4. add t6, t2, t3
I'm not getting terrible from this.On the other hand I take this article with a grain of salt anyhow, since it only discusses a single example. I think we would need a lot more optimized assembly snippet comparisons to make meaningful conclusions (and even then there could be authored selection bias).
>"here's this snippet, it takes more instructions on RISC-V, thus RISC-V bad"
Is pretty much what it's saying. An actual argument about ISA design would weight the cost this has with the advantages of not having flags, provide a body of evidence and draw conclusions from it. But, of course, that would be much harder to do.
What's comparatively easy and they should have done, however, is to read the ISA specification. Alongside the decisions that were made, there's a rationale to support it. Most of these choices, particularly so the ones often quoted in FUD as controversial or bad, have a wealth of papers, backed by plentiful evidence, behind them.
For those are more versed, is this really a general problem?
I was under the impression that the real bottleneck is memory, and things like this would be fixed in real applications through out of order execution, and that it payed off having simpler instructions because compilers had more freedom to rearrange things.
Is that even a fair comparison given the arm and x86 versions used as examples of "better" were 64 bit?
If we're really comparing 32 and 64 and complaining that 32 bit uses more instructions than 64, perhaps we should dig out the 4 bit processors and really sharpen the pitchforks. Alternatively, we could simply not. Comparing apples to oranges doesn't really help.
From the article:
Let's look at some examples of how Risc V underperforms.
First, addition of a double-word integer with carry-out:
add t0, a4, a6 // add low words
sltu t6, t0, a4 // compute carry-out from low add
add t1, a5, a7 // add hi words
sltu t2, t1, a5 // compute carry-out from high add
add t4, t1, t6 // add carry to low result
sltu t3, t4, t1 // compute carry out from the carry add
add t6, t2, t3 // combine carries
Same for 64-bit arm:
adds x12, x6, x10
adcs x13, x7, x11
Same for 64-bit x86:
add %r8, %rax
adc %r9, %rdx
You should take into account that the libgmp authors have a huge amount of experience in implementing operations with large integers on a very large number of CPU architectures, i.e. on all architectures supported by gcc, and for most of those architectures libgmp has been the fastest during many years, or it still is the fastest.
"I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project"
Utter horse manure.
Perhaps something similar is needed within ISAs / CPUs ? Say an OS kernel, a ZIP-algorithm, Mandelbrot, Fizz-buzz ... could measure code compactness but also performance and energy usage.
Everything should be written in C, or some scripting language implemented in C. Writing safe code is easy, just wrap everything in layers of macros that the compiler will magically optimize away, and if it doesn't, computers are fast enough anyway, right? The mark of a real programmer is that every one of their source files includes megabytes of headers defining things like __GNU__EXTENSION_FOO_BAR_F__UNDERSCORE_.
You say your processor has a single instruction to do some extremely common operation, and want to use it? You shouldn't even be reading a processor manual unless you are working on one of the two approved compilers, preferably GCC! If you are very lucky, those compiler people that are so much smarter than you could hope to be, have already implemented some clever transformation that recognizes the specific kind of expression produced by a set of deeply nested macros, and turns them into that single instruction. In the process, it will helpfully remove null pointer checks because you are relying on undefined behaviour somewhere else.
You say you'll do it in assembly? For Kernighan's sake, think about portability!!! I mean, portable to any other system that more or less looks the same as UNIX, with a generous sprinkling of #ifdefs and a configure script that takes minutes to run.
Implement a better language? Sure, as long as the compiler is written in C, preferably outputs C source code (that is then run through GCC), and the output binary must of course link against the system's C library. You can't do it any other way, and every proper UNIX - BSD or Mac OS X - will make it literally impossible by preventing syscalls from any other piece of code.
IMO this is like a cultural virus that seems to have infected everything IT-related, and I don't exactly understand why. Sure, having all these layers of cruft down below lets us build the next web app faster, but isn't it normal to want to fix things? Do some people actually get a sense of satisfaction out of saying "It is a solved problem, don't reinvent the wheel"? Or do they want to think that their knowledge of UNIX and C intricacies is somehow the most important, fundamental thing in computer science?
Isn't this the classic RISC vs CISC problem?
Comparing x86/ARM to RISC-V feels like Apples to Grains of Rice.
If RISC-V was born out of a need for an open source embedded ISA, would the ISA not need to remain very RISC-like to accommodate implementations with fewer available transistors... Or is this an outdated assumption?
Maybe SISC - "Simplified" instruction set computing, perhaps. ARM isn't exactly super complicated in this particular aspect (it is elsewhere), but in this case the designers basically chose to make branches simpler at the expense of code that needs to check overflows (or flags more generally)
RISC-V was born partly out of a desire for a teaching ISA, also, so simplicity is a boon in that context too.
Whether the similar awkwardness applies to a lot of other code or not is not being told by this isolated case.
Moderators where are you?
I'm not a fan of the RISC-V design but the presence or absence of this instruction doesn't make it a terrible architecture.
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
Not wasting much sleep on this one. Not sure there's anything on the spec that stops implementations from recognizing the two instructions and fuse them into a single atomic operation for the backends to deal with. It'll occupy more space in the L1 cache, but that's it.
It does not matter much, because there is a sequence of dependent instructions, which cannot be executed in parallel, regardless which is the maximum IPC of a RISC-V CPU.
The opinions from those messages matter, because they belong to experts in implementing operations with large integers on a lot of different CPU architectures, with high performance proven during decades of ubiquitous use of their code. They have certainly a better track record than any RISC-V designer.
It doesn't matter how great something else could be in theory if it doesn't exist or doesn't meet the same scale and mindshare (or adoption).