But ultimately, the gist of their argument is this:
>Any task will require more Risc V instructions that any contemporary instruction set.
Which is easy to verify as utter nonsense. There's not even a need to look at the research, which shows RISC-V as the clear winner in code density. It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.
I am familiar with many tens of instruction sets, since the first computers with vacuum tubes until all the important instruction sets that are still in use, and there is no doubt that RISC-V requires more instructions and a larger code size than almost all of them, for doing any task.
Even the hard-to-believe "research" results published by RISC-V developers have always showed worse code density than ARM, the so-called better results were for the compressed extension, not for the normal encoding.
Moreover, the results for RISC-V are hugely influenced by the programming language and the compiler options that are chosen. RISC-V has an acceptable code size only for unsafe code, if the programming language or the compiler options require run-time checks, to ensure safe behavior, then the RISC-V code size increases enormously, while for other CPUs it barely changes.
The RISC-V ISA has only 1 good feature for code size, the combined compare-and-branch instructions. Because there typically is 1 branch for every 6 to 8 instructions, using 1 instruction instead of 2 saves a lot.
Except for this good feature, the rest of the ISA is full of bad features, which frequently require at least 2 instructions instead of 1 instruction in any other CPU, e.g. the lack of indexed addressing, which is needed in any loop that must access some aggregate data structure, in order to be able to implement the loop with a minimum number of instructions.
>Even the hard-to-believe "research" results published by RISC-V developers have always showed worse code density than ARM
the code size advantage of RISC-V is not artificial academic bullshit. It is real, it is huge, and it is trivial to verify. Just build any non-trivial application from source with a common compiler (such as GCC or LLVM's clang) and compare the sizes you get. Or look at the sizes of binaries in Linux distributions.
>the so-called better results were for the compressed extension, not for the normal encoding.
The C extension can be used anywhere, as long as the CPU supports the extension; most RISC-V profiles require it. This is in stark contrast with ARMv7's thumb, which was a literal separate CPU mode. Effort was put in making this very cheap for the decoder.
The common patterns where number of instructions is larger are made irrelevant by fusion. RISC-V has been thoroughly designed with fusion in mind, and is unique in this regard. It is within its right in calling itself the 5th generation RISC ISA because of this, even if everything else is ignored.
Fusion will turn most of these "2 instructions instead of one" into actually one instruction from the execution unit perspective. There's opportunities everywhere for fusion, the patterns are designed in. The cost of fusion on RISC-V is also very low, often quoted as 400 gates, allowing even simpler microarchitectures to implement it.
Ignoring RISC-V’s compressed encoding seems a rather artificial restriction.
Which isn't really a big advantage, because ARM and x86 macro-op fuse those instructions together. (That is, those 2-instructions are decoded and executed as 1x macro-op in practice).
cmp /jnz on x86 is like, 4-bytes as well. So 4-bytes on x86 vs 4-bytes on RISC-V. 1-macro-op on x86 vs 1-instruction on RISC-V.
So they're equal in practice.
-----
ARM is 8-bytes, but macro-op decoded. So 1-macro op on ARM but 8-bytes used up.
You have no idea what you're talking about. I've worked on designs with both ARM and RISC-V cores. The RISC-V code outperforms the ARM core, with smaller gate count, and has similar or higher code density in real world code, depending on the extensions supported. The only way you get much lower code density is without the C extension, but I haven't seen it not implemented in a real-world commercial core, and if it wasn't, I'm sure there was because of a benefit (FPGAs sometimes use ultra-simple cores for some tasks, and don't always care about instruction throughput or density)
It should be said that my experience is in embedded, so yes, it's unsafe code. But the embedded use-case is also the most mature. I wouldn't be surprised if extensions that help with safer programming languages would be added for desktop/server class CPUs, if they haven't already (I haven't followed the development of the spec that closely recently)
What are your thoughts on the way RISC V handled the compressed instructions subset?
It's perfectly possible to have read the spec and disagree with the rationale provided. RISC-V is in fact the outlier among ISAs in many of these design decisions, so there's a heavy burden of proof to demonstrate that making the contrary decisions in many cases was the right call.
> Which is easy to verify as utter nonsense. There's not even a need to look at the research, which shows RISC-V as the clear winner in code density. It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.
This doesn't seem to be true when you actually do an apples-to-apples comparison.
Taking as an example the build of Bash in Debian Sid (https://packages.debian.org/sid/shells/bash). I chose this because I'm pretty confident there's no functional or build-dependency difference that will be relevant here. Other examples like the Linux kernel are harder to compare because the code in question is different across architectures. I saw the same trend in the GCC package, so it's not an isolated example.
riscv64 installed size: 6,157.0 kB amd64 installed size: 6,450.0 kB arm64 installed size: 6,497.0 kB armhf installed size: 6,041.0 kB
RV64 is outperforming the other 64-bit architectures, but under-performing 32-bit ARM. This is consistent with expectations: amd64 has a size penalty due to REX bytes, arm64 got rid of compressed instructions to enable higher performance, and armhf (32-bit) has smaller constants embedded in the binary.
Compressed instructions definitely do work for making code smaller, and that's part of why arm32 has been very successful in the embedded space, and why that space hasn't been rushing to adopt arm64. For arm32, however, compressed instructions proved to be a limiting factor on high performance implementation, and arm64 moved away from them because of it. Maybe that's due to some particular limitations of arm32's compressed instructions that RISC-V compressed instructions won't suffer from, but that remains to be proven.
To compare the code sizes, you need tools like "size", "readelf" etc. and the data given by the tools should still be studied, to see how much of the code sections really contain code.
I have never seen until now a program where the RISC-V variant is smaller than the ARMv8 or Intel/AMD variant, and I doubt very much that such a program can exist. Except for the branches, where RISC-V frequently needs only 4 bytes instead of 5 bytes for Intel/AMD or 8 bytes for ARMv8, for all the other instructions it is very frequent to need 8 bytes for RISC-V instead of 4 bytes for ARMv8.
Moreover, choosing compiler options like -fsanitize for RISC-V increases the number of instructions dramatically, because there is no hardware support for things like overflow detection.
Genuinely asking, why? Do we think RISC-V should, or even could, try to compete against the AMD/Intel/ARM behemoths on their playing field? Obviously ISAs are a low level detail and far removed from the end product, but it feels like the architectural decisions we are "stuck with" today are inextricably intertwined with their contemporary market conditions and historical happenstance. It feels like all the experimental architectures that lost to x86/ARM (including Intel's own) were simply too much too soon, before ubiquitous internet and the open source culture could establish itself. We've now got companies using genetic algorithms to optimize ICs and people making their own semiconductors in the 100s of microns range in their garages - maybe it's time to rethink some things!
(EE in a past life but little experience designing ICs so I feel like I'm talking out of my rear end)
Fusing instructions isn't just theoretical either. I'm pretty sure it is or will be a common optimisation for CPUs aiming for high performance. How exactly is two easily-fused 16-bit instructions worse than one 32-bit one? Is there really a practical difference other than the name of the instruction(s)?
At the same time, the reduced transistor count you get from a simpler instruction set is not a benefit to be just dismissed either. I'm starting to see RISC-V cores being put all over the place in complex microcontrollers, because they're so damn cheap, yet have very decent performance. I know a guy developing a RISC-V core. He was involved with the proposal for a couple of instructions that would put the code density above Thumb for most code, and the performance of his core was better than Cortex-M0 at a similar or smaller gate count. I'm not sure if the instructions was added to the standard or not though.
Even for high performance CPUs, there's a case to be made for requiring fewer transistors for the base implementation. It makes it easier to make low-power low-leakage cores for the heterogeneous architecture (big.little, M1, etc.) which is becoming so popular.
Funny, I thought the whole thing was bitching that RISC V has no carry flag which obviously causes multi word arithmetic to take more instructions. The obvious workaround is to use half-words and use the upper half for carry. There may be better solutions, but at twice the number of instructions this "dumb" method is better than what the author did.
Flags were removed because they cause a lot of unwanted dependencies and contention in hardware designs and they aren't even part of any high level language.
I still think instead of compare-and-branch they should have made "if" which would execute the following instruction only if true. But that's just just an opinion. I also hate the immediate constants (12 bits?) Inside the instruction. Nothing wrong with 16 32 or 64bit immediate data after the opcode.
I hope RISC 6 will come along down the road (not soon) and fix a few things. But I like the lack of flags...
Risc-v basically says "lets make the implicit, explicit" and you have to essentially use registers to store the carry information when operating on bigints. Which for the current impl means chaining more instructions.
Is that correct?
That sounds like what the FP crowd is always talking about - eschewing shared state so it's easier to reason about, optimize, parallelize, etc.
Nevertheless, the ISA speaks for itself. The goal of a technical project is to produce a technical artifact, not to generate good feelings having followed a "solid" process.
If the process you followed brought you to this, of what use was the process?
Also, the godbolt.org compiler explorer has Risc-V support: useful for someone interested in comparing specific snippets of code.
https://en.wikipedia.org/wiki/Reduced_instruction_set_comput...
Wait, if we are talking about actual ISA instructions, why is it hard to believe that RISC-V would have more of them ? The argument in favor of RISC is to simplify the frontend because even for a complex ISA like x86, the instructions will get converted to many micro-ops. In terms of actual ISA instructions, it seems quite reasonable that x86 would have fewer of those (at the cost of frontend complexity).
Doing it using a small pool of instructions, too (as RISC-V does), is the cherry on top.
Therein lies the problem. Nobody ever goes out guns blazing complaining about too many instructions despite the fact that complexity has its own downsides.
RISC-V has been designed aggressively to have minimal ISA to leave plenty of room to grow, and require minimal number of transistors for a minimal solution.
Should this be a showstopper down the road, then there will be plenty of space to add an extensions that fixes this problem. Meanwhile embedded systems paying a premium for transistors are not going to have to pay for these extra instructions as only 47 instructions have to be implemented in a minimal solution.
I think in 10-20 years everyone will agree that all the "bad" RISC-V decisions don't matter. The same way x86 (CISC) was supposed to be bad because of legacy/backwards compatibility.
It's a trade off - and the one that's been made is one that makes it possible to make ALL instructions a little faster at the expense of one particular case that isn't used much - that's how you do computer architecture, you look at the whole, not just one particular case
RISCV also specifies a 128-bit variant that is of course FASTER than these examples
I wish there was a way out.
Language features are also often implemented at least partly because they can be done efficiently on the premiere hardware for the language. Then new hardware can make such features hard to implement.
WASM implemented return values in a way that was different from register hardware, and it makes efficient codegen of Common Lisp more challenging. This was brought to the attention of the committee while WASM was still in flux, and they (perhaps rightfully) decided CL was insufficiently important to change things.
I'm sure that people brought up the overflow situation to the RISC-V designers, and it was similarly dismissed. It's just unfortunate that legacy software is such a big driver of CPU features as that's a race towards lowest-common-denominator hardware.
That said, I think it's less of an issue these days for JS implementors in particular. It might have mattered more back in the day when pure JS carried a lot of numeric compute load and there weren't other options. These days it's better to stow that compute code in wasm and get predictable reliable performance and move on.
The big pain points in perf optimization for JS is objects and their representation, functions and their various type-specializations.
Another factor is that JS impls use int32s as their internal integer representation, so there should be some relatively straightforward approach involving lifting to int64s and testing the high half for overflow.
Still kind of cumbersome.
There are similar issues in existing ISAs. NaN-boxing for example uses high bits to store type info for boxed values. Unboxing boxed values on amd64 involves loading an 8-byte constant into a free register and then using that to mask out the type. The register usage is mandatory because you can't use 64-bit values as immediates.
I remember trying to reduce code size and improve perf (and save a scratch register) by turning that into a left-shift right-shift sequence involving no constants, but that led to the code executing measurably slower as it introduced data dependencies.
If desktop/server-class RISC-V CPUs become more common, it's not unreasonable to think they'll add an extension that covers the needs of managed/higher-level languages like RISC-V.
Even for server-class CPUs you could argue that you absolutely want this extension to be optional, as you can design more efficient CPUs for datacenters/supercomputers where you know what kind of code you'll be running.
Where this really bites you is in workloads dominated by tight loops (image processing, cryptography, HPC, etc). While a microarchitecture may be more efficient thanks to simpler instructions (ignoring the added complexity of compressed instructions and macro-fusion, the usual suggested fixes...), it's not going to be 2-3x faster, so it's never going to compensate for a 2-3x larger inner loop.
Instruction decoding and memory ordering can be a bit of nightmare on CISC ISAs and fewer macro-instructions are not automatically a win. I guess we'll eventually see in benchmarks.
Even though Intel has had decades to refine their CPUs I'm quite excited to see where RISC-V is going.
As someone else who replied said, I'm not a CPU architect, just software that works close to the metal. That means I pay attention to compiler output.
What you say is true in the very early says: compilers did indeed use the x86's addressing modes in all sorts of odd ways to squeeze as many calculations as possible into as few bytes as possible. Then it went in the reverse direction. You started seeing compilers emitting long series of simple instructions instead, seemingly deliberately avoiding those complex addressing modes. And now it's swung back again - I'm the complier using addressing modes to shift plus a couple of adds in one instruction is common again. I presume all these shifts were driven by speed of the resulting code.
I have no idea why one method was faster than the other - but clearly there is no hard and fast rule operating here. For some internal x86's implementations using complex addressing modes was a win. On some, for exactly the same instruction set, it wasn't. There is no cut and dried "best" way of doing it, rather it varies as the transistor and power budget changes.
One thing we do know about RISC-V is it is intended to cover a _lot_ transistor and power budgets. Where it's used now (low power / low transistor) is turned out their design decisions have turned out _very_ well, far better than x86.
More fascinatingly to me, today the biggest speed ups compilers get for super scalar arch's has nothing to do with the addressing modes so much attention is being focused on here. It comes from avoiding conditional jumps. The compilers will often emit code that evaluates both paths of the computation (thus burning 50% more ALU time on a calculating a result that will never be used), then choose the result they want with a cmov. In extreme cases, I've seen doing that sort of thing gain them a factor of 10, which is far more than playing tiddly winks with addressing modes will get you.
I have no idea how that will pan out for RISC-V. I don't think any one has done a super scalar implementation of it yet(?) But in the non-super scalar implementations the RISC-V instruction set choices have worked out very well so far. And when someone does do a super scalar implementation (and I'm sure there will be a lot of different implementations over time), it seems very possible x86's learnings on addressing mode use will be yesterdays news.
It's doesn't have to be _that_ bad. As long as condition flags are all written at once (or are essentially banked like PowerPC) the dependency issue can go away because they're renamed and their results aren't dependent on previous data.
Now, of course, instructions that only update some condition flags and preserve others are the devil.
> all those extra instructions to compute carry will blow the I$ faster
i think the idea is, as others have mentioned, the add/comp instructions are fused internally to a single instruction, so probably its not that bad for i$ as we might think?Is it actually implemented on any hardware?
When you hear the "<person / group> could make a better <implementation> in <short time period>" - call them out. Do it. The world will not shun a better open license ISA. We even have some pretty awesome FPGA boards these days that would allow you to prototype your own ISA at home.
In terms of the market - now is an exceptionally great time to go back to the design room. It's not as if anybody will be manufacturing much during the next year with all of the fab labs unable to make existing chips to meet demand. There is a window of opportunity here.
> It is, more-or-less a watered down version of the 30 year old Alpha ISA after all. (Alpha made sense at its time, with the transistor budget available at the time.)
As I see it, lower numbers of transistors could also be a good thing. It seems blatantly obvious at this point that multi-core software is not only here to stay, but is the future. Lower numbers of transistors means squeezing more cores onto the same silicon, or implementing larger caches, etc.
I also really like the Unix philosophy of doing one simple thing well. Sure, it could have some special instruction that does exactly your use case in one cycle using all the registers, but that's not what has created such advances in general purpose computing.
> Sure, it is "clean" but just to make it clean, there was no reason to be naive.
I would much rather we build upon a conceptually clean instruction set, rather than trying to hobble together hacks on top of fundamentally flawed designs - even at the cost of performance. It's exactly these hobbled conceptual hacks that have lead to the likes of spectre and meltdown vulnerabilities, when the instruction sets become so complicate that they cannot be easily tested.
But the author making an argument like that...
> I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project.
Pretty much blew their credibility. It's obviously wrong, and a sensible, fair person wouldn't write it.
When you use a CPU architecture you don't just get an ISA.
You also get compilers and debuggers. Ready-to-run Linux images. JIT compilers for JavaScript and Java. Debian repos and Python wheels with binaries.
And you get CPUs with all the most complex features. Instruction re-ordering, branch prediction, multiple cores, multi-level caches, dynamic frequency and voltage control. You want an onboard GPU, with hardware 4k h264 encoding and decoding? No problem.
And you get a wealth of community knowledge - there are forum posts and StackOverflow questions where people might have encountered your problems before. If you're hiring, there are loads of engineers who've done a bit of stuff with that architecture before. And of course vendors actually making the silicon!
I've seen ISAs documented with a single sheet of A4 paper. The difficult part in having a successful CPU architecture is all the other stuff :)
How about some 32 way SMT GPUs... No more divergence!
That allows more flexibility for CPU designs to optimize transistor count vs speed vs energy consumption.
This guy clearly did not look at the stated rationale for the design decisions of RISC-V.
Beyond that, compressed instructions are not a 1:1 substitute for more complex instructions, because a pair of compressed instructions cannot have any fields that cross the 16-bit boundary. This means you can't recover things like larger load/store offsets.
Additionally, you can't discard architectural state changes due to the first instruction. If you want to fuse an address computation with a load, you still have to write the new address to the register destination of the address computation. If you want to perform clever fusion for carry propagation, you still have to perform all of the GPR writes. This is work that a more complex instruction simply wouldn't have to perform, and again it complicates a high performance implementation.
They spent a lot of time and effort on making sure the decoding pretty good and useful for high performance implementations.
RISC-V is designed for very small and very large system. At some point some tradeoffs need to be made but these are very reasonable and most of the time no a huge problem.
For the really specialized cases where you simply can't live with those extra instruction, those will be added to the standard and then some profiles will include them and others not. If those instructions are really as vital as those that want them claim, they will find their way into many profiles.
Saying RISC-V is 'terrible' because of those choices is not fair way of evaluating it.
Besides that, you raise good points on sources of complexity. I’m waiting for the benchmarks once such developments have been incorporated. Everything else is guesswork.
More difficult than x86? We're talking about a damn simple variable width decoding here.
I could imagine RISC-V with C extension being more tricky than 64-bit ARM. Maybe.
> and again it complicates a high performance implementation.
But so much of the rationale behind the design of RISC-V is to simplify high performance implementation in other ways. So the big question is what the net effect is.
The other big question is if extensions will be added to optimise for desktop/server workloads by the time RISC-V CPUs penetrate that market significantly.
Of course you discard architectural state changes in fusion. If I have a bunch of instructions which end up reading from memory into register x10, then I can fuse with all previous instructions which wrote into x10, as their results get clobbered anyway.
Disclaimer: I may have misunderstood the point you made. However you don’t seem to make it clear how fusion is bad for performance.
What performance tricks are you giving up by doing fusion?
> I have heard that Risc V proponents say that these problems are known and could be fixed by having the hardware fuse dependent instructions. Perhaps that could lessen the instruction set shortcomings, but will it fix the 3x worse performance for cases like the one outlined here?
Macro-fusion can to some extent offset the weak instruction set, but you're never going to get a multiple integer multiplier speedup out of it given the complexity of inter-op architectural state changes that have to be preserved, and instruction boundary limitations involved; it's never going to offset a 3x blowup in instruction count in a tight loop.
Also, it's said that x86 is bad because the instructions are then reorganized and translated inside the CPU. But it seems that you are proposing the same, the CPU that preprocessed the instructions and fuses some into a single one (the opposite that x86 does). Ad that point, it seems to me that what x86 does makes more sense: have a ton of instruction (and thus smaller programs and thus more code that can fit in cache) and split them, rather than having a ton of instructions (and waste cache space) for then the CPU to combine them into a single one (a thing that a compiler can also do).
Anyway what you gain from this is a very simple ISA, which helps tool writers, those who implement hardware as well in academia for teaching and research.
How does the insanely complex x86 instructions help anyone?
RISC-V has a number of places it's employed where it makes an excellent fit. First of all academia. For an undergrad making building the netlist for their first processor or a grad student doing their first out of order processor RISC-V's simplicity is great for the pedagogical purpose. For a researcher trying to experiment with better branch prediction techniques having a standard high-ish performance open source design they can take and modify with their ideas is immensely helpful. And for many companies in the real world with their eyes on the bottom line like having an ISA where you can add instructions that happen to accelerate your own particular workload, where you can use a standard compiler framework outside your special assembly inner loops, and where you don't have to spend transistors on features you don't need.
I'm not optimistic about RISC-V's widescale adoption as an application processor. If I were going to start designing an open source processor in that space I'd probably start with IBM's now open Power ISA. But there are so many more niches in the world than just that and RISC-V is already a success in some of them.
Kinda stopped reading here. It's a pretty arrogant hot take. I don't know this guy, maybe he's some sort of ISA expert. But it strains credulity that after all this time and work put into it, RISC-V is a "terrible architecture".
My expectation here is that RISC-V requires some inefficient instruction sequences in some corners somewhere (and one of these corners happens to be OP's pet use case), but by and large things are fine.
And even then, I don't think that's clear. You're not going to determine performance just by looking at a stream of instructions on modern CPUs. Hell, it's really hard to compare streams of instructions from different ISAs.
Seems quite balanced with all the other replies here which claim it's the best architecture ever whenever anyone says anything about it.
I don't think its vector extensions would be good for video codecs because they seem designed around large vectors. (and the article the designers wrote about it was quite insulting to regular SIMD)
RISC-V is pretty good. Probably slightly better for some things than ARM, and slightly worse for others. It's open, which is awesome, and the instruction set lends itself to extensions which is nice (but possibly risks the ecosystem fragmenting). Building really high performance RISC-V designs looks like it's going to rely on slightly smarter instruction decoders than we've seen in the past for RISCs, but it doesn't look insurmountable.
Bad? Quite possible, it was meant as a teaching ISA initially IIRC, but terrible? Who knows.
If you look at the early history of RISC-V, it does indeed look like as something built for teaching. But I don't think that use case warrants all the hype around it.
So how did all the hype form, and why is it that there are people seemingly hyping it as the next-gen dream-come-true super elegant open developed-with-hindsight ISA that will eventually displace crufty old x86 and proprietary ARM while offering better performance and better everything? Of course that just baits you into arguing about its potential performance. And don't worry if it doesn't have all the instructions you need for performance yet, we'll just slap it with another extension and it totally won't turn into a clusterfuck with a stench of legacy and numerous attempts at fixing it (coz' remember, hindsight)!
And then if you question its potential, you'll get someone else arguing that no no, it's not a high performance ISA for general use in desktops / servers, it's just an extensible ISA that companies can customize for their special sauce microcontrollers or whatever.
Of course it's all armchair speculation because there are no high performance real world implementations and there aren't enough experts you can trust.
typedef __int128_t int128_t;
int128_t add(int128_t left, int128_t right)
{
return left + right;
}
GCC 10, -O2, RISC-V: add(__int128, __int128):
mv a5,a0
add a0,a0,a2
sltu a5,a0,a5
add a1,a1,a3
add a1,a5,a1
ret
ARM64: add(__int128, __int128):
adds x0, x0, x2
adc x1, x1, x3
ret
This issue hurts the wider types that are compiler built-ins.Even though C has a programming model that is devoid of any carry flag concept, canned types like a 128 bit integer can take advantage of it.
Portable C code to simulate a 128 bit integer will probably emit bad code across the board. The code will explicitly calculate the carry as an additional operand and pull it into the result. The RISC-V won't look any worse, then, in all likelihood.
(The above RISC-V instruction set sequence is shorter than the mailing list post author's 7 line sequence because it doesn't calculate a carry out: the result is truncated. You'd need a carry out to continue a wider addition.)
2-instructions to work with 64-bits, maybe 1 more instruction / macro-op for the compare-and-jump back up to a loop, and 1 more instruction for a loop counter of somekind?
So we're looking at ~4 instructions for 64-bits on ARM/x86, but ~9-instructions on RISC-V.
The loop will be performed in parallel in practice however due to Out-of-order / superscalar execution, so the discussion inside the post (2 instruction on x86 vs 7-instructions on RISC-V) probably is the closest to the truth.
----------
Question: is ~2-clock ticks per 64-bits really the ideal? I don't think so. It seems to me that bignum arithmetic is easily SIMD. Carries are NOT accounted for in x86 AVX or ARM NEON instructions, so x86, ARM, and RISC-V will probably be best.
I don't know exactly how to write a bignum addition loop in AVX off the top of my head. But I'd assume it'd be similar to the 7-instructions listed here, except... using 256-bit AVX-registers or 512-bit AVX512 registers.
So 7-instructions to perform 512-bits of bignum addition is 73-bits-per-clock cycle, far superior in speed to the 32-bits-per-clock cycle from add + adc (the 64-bit code with implicit condition codes).
AVX512 is uncommon, but AVX (256-bit) is common on x86 at least: leading to ~36-bits-per-clock tick.
----------
ARM has SVE, which is ambiguous (sometimes 128-bits, sometimes 512-bits). RISC-V has a bunch of competing vector instructions.
..........
Ultimately, I'm not convinced that the add + adc methodology here is best anymore for bignums. With a wide-enough vector, it seems more important to bring forth big 256-bit or 512-bit vector instructions for this use case?
EDIT: How many bits is the typical bignum? I think add+adc probably is best for 128, 256, or maybe even 512-bits. But moving up to 1024, 2048, or 4096 bits, SIMD might win out (hard to say without me writing code, but just a hunch).
2048-bit RSA is the common bignum, right? Any other bignums that are commonly used? EDIT2: Now that I think of it, addition isn't the common operation in RSA, but instead multiplication (and division which is based on multiplication).
There is only one standard V extension. Alibaba made a chip with a prerelease version of that V extension which is thus incompatible with the final version, but in practice that just means that the vector unit on that chip is not used because it is incompatible, not that there are now competing standards
add+adc should still be 64 bits per cycle. adc doesn't just add the carry bit, it's an add instruction which includes the usual operands, plus the carry bit from the previous add or adc.
Which is why I'm sure add / adc will still win at 128-bits, or 256-bits.
The main issue is that the vector-add instructions are missing carry-out entirely, so recreating the carry will be expensive. But with a big enough number, that carry propagation is parallelizable in log2(n), so a big enough bignum (like maybe 1024-bits) will probably be more efficient for SIMD.
MIPS didn't have a flag register either and depended on a dedicated zero register and slt instructions (set if less than)
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-...
MIPS is classical RISC design that was not designed to be OoO-friendly at all and is simply designed for ease of straightforward pipelined implementation. The reason why it does not have flags probably simply comes down to the observation that you don't need flags for C.
Edit: Don't get me wrong, I don't think RISC-V is "garbage" or anything like that. I just think it could have been better. But of course, most of an architecture's value comes from its ecosystem and the time spent optimizing and tailoring everything...
What sticks in my mind from my limited exposure to SuperH is that there's no load immediate instruction, so you have to do a PC-relative load instead. It was clearly optimized for compiled rather than handwritten code!
SuperH has a mov #imm, Rx that can take an 8-bit #imm. But you're right, literal pools were used just like on ARM.
Things I liked about SuperH: 16 bit fixed-width insn format (except for some SH2A and DSP ops), T flag for bit manipulation ops, GBR to enable scaled loads with offset, xtrct instruction, single-cycle division insns (div0, div1), MAC insns.
In terms of code density SH was quite effective, see here http://web.eece.maine.edu/~vweaver/papers/iccd09/iccd09_dens... or here http://www.deater.net/weave/vmwprod/asm/ll/ll.html
Not having anything that stands out is perhaps a good thing. Being "clever" with the ISA tends to bite you when implementing OoO superscalar cores.
You can detect carry of (a+b) in C branch-free with: ((a&b) | ((a|b) & ~(a+b))) >> 31
So 64-bit add in C is:
f_low = a_low + b_low
c_high = ((a_low & b_low) | ((a_low | b_low) & ~f_low)) >> 31
f_high = a_high + b_high + c_high
So for RISC-V in gcc 8.2.0 with -O2 -S -c add a1,a3,a2
or a5,a3,a2
not a7,a1
and a5,a5,a7
and a3,a3,a2
or a5,a5,a3
srli a5,a5,31
add a4,a4,a6
add a4,a4,a5
But for ARM I get (with gcc 9.3.1): add ip, r2, r1
orr r3, r2, r1
and r1, r1, r2
bic r3, r3, ip
orr r3, r3, r1
lsr r3, r3, #31
add r2, r2, lr
add r2, r2, r3
It's shorter because ARM has bic. Neither one figures out to use carry related instructions.Ah! But! There is a gcc macro: __builtin_uadd_overflow() that replaces the first two C lines above: c_high = __builtin_uadd_overflow(a_low, b_low, &f_low);
So with this:
RISC-V:
add a3,a4,a3
sltu a4,a3,a4
add a5,a5,a2
add a5,a5,a4
ARM: adds r2, r3, r2
movcs r1, #1
movcc r1, #0
add r3, r3, ip
add r3, r3, r1
RISC-V is faster..EDIT: CLANG has one better: __builtin_addc().
f_low = __builtin_addcl(a_low, b_low, 0, &c);
f_high = __builtin_addcl(a_high, b_high, c, &junk);
x86: addl 8(%rdi), %eax
adcl 4(%rdi), %ecx
ARM: adds w8, w8, w10
add w9, w11, w9
cinc w9, w9, hs
RISC-V: add a1, a4, a5
add a6, a2, a3
sltu a2, a2, a3
add a6, a6, a2I find it funny that you make the same pitfall than the author did.
Faster on which CPU?
The author doesn't measure on any CPU, so here there are dozens of people hypothesizing whether fusion happens or not, and what the impact is.
Perhaps faster means fewer instructions in this instance? Considering number of instructions is what has been discussed.
In addition to the actual ALU instructions doing the add with carry, for bignums it's important to include the load and store instructions. Even in L1 cache it's typically 2 or 3 or 4 cycles to do the load, which makes one or two extra instructions for the arithmetic less important. Once you get to bignums large enough to stream from RAM (e.g. calculating pi to a few billion digits) it's completely irrelevant.
This especially applies to potentially controversial things.
Overall, I feel HN is most fun when a lot of people are in disagreement but also operating in good faith.
But I agree that this bit of writing comes across as a bit overly assertive and arrogant; and probably trivially proved wrong by actually running some benchmarks.
By the same reasoning, the Apple M1 would obviously be slower than anything Intel and AMD produce given similar energy and transistor density constraints (i.e. same class of hardware). Except that obviously isn't the case and we have the Macbook air with the M1 more than holding up against much more expensive Intel/AMD chips. Reason: chips don't actually work like this person seems to assume. The whole article is a sandcastle of bad assumptions leading up to an arrogantly worded & wrong conclusion.
You do not criticise The Rusted Holy Grail and the Riscy Silver Bullet.
Many people still think that RISC-V implies an open source implementation, for example.
The minimum duration of the clock cycle of a modern CPU is essentially determined by the duration of a 64-bit integer addition/subtraction, because such operations need a latency of only 1 clock cycle to be useful.
Operations that are more complex than 64-bit integer addition/subtraction, e.g. integer multiplications or floating-point operations, need multiple cycles, but they are pipelined so that their throughput remains at 1 per cycle.
So 64-bit addition/subtraction is certainly expected to be included in any RISC ISA.
The hardware adders used for addition/subtraction provide, at a negligible additional cost, 2 extra bits, carry and overflow, which are needed for operations with large integers and for safe operations with 64-bit integers.
The problem is that the RISC-V ISA does not offer access to those 2 bits and generating them in software requires a very large cost in execution time and in lost energy in comparison with generating them in hardware.
I do not see any relationship between these bits and the RISC concepts, omitting them does not simplify the hardware, but it makes the software more complex and inefficient.
My code snippet results in bloated code for RISC-V RV64I.
I'm not sure how bloated it is. All of those instructions will compress [1].[1] https://riscv.org/wp-content/uploads/2015/05/riscv-compresse...
It's slower on RISC-V but not a lot on a superscalar. The x86 and ARMv8 snippets have 2 cycles of latency. The RISC-V has 4 cycles of latency.
1. add t0, a4, a6 add t1, a5, a7
2. sltu t6, t0, a4 sltu t2, t1, a5
3. add t4, t1, t6 sltu t3, t4, t1
4. add t6, t2, t3
I'm not getting terrible from this.On the other hand I take this article with a grain of salt anyhow, since it only discusses a single example. I think we would need a lot more optimized assembly snippet comparisons to make meaningful conclusions (and even then there could be authored selection bias).
>"here's this snippet, it takes more instructions on RISC-V, thus RISC-V bad"
Is pretty much what it's saying. An actual argument about ISA design would weight the cost this has with the advantages of not having flags, provide a body of evidence and draw conclusions from it. But, of course, that would be much harder to do.
What's comparatively easy and they should have done, however, is to read the ISA specification. Alongside the decisions that were made, there's a rationale to support it. Most of these choices, particularly so the ones often quoted in FUD as controversial or bad, have a wealth of papers, backed by plentiful evidence, behind them.
For those are more versed, is this really a general problem?
I was under the impression that the real bottleneck is memory, and things like this would be fixed in real applications through out of order execution, and that it payed off having simpler instructions because compilers had more freedom to rearrange things.
Is that even a fair comparison given the arm and x86 versions used as examples of "better" were 64 bit?
If we're really comparing 32 and 64 and complaining that 32 bit uses more instructions than 64, perhaps we should dig out the 4 bit processors and really sharpen the pitchforks. Alternatively, we could simply not. Comparing apples to oranges doesn't really help.
From the article:
Let's look at some examples of how Risc V underperforms.
First, addition of a double-word integer with carry-out:
add t0, a4, a6 // add low words
sltu t6, t0, a4 // compute carry-out from low add
add t1, a5, a7 // add hi words
sltu t2, t1, a5 // compute carry-out from high add
add t4, t1, t6 // add carry to low result
sltu t3, t4, t1 // compute carry out from the carry add
add t6, t2, t3 // combine carries
Same for 64-bit arm:
adds x12, x6, x10
adcs x13, x7, x11
Same for 64-bit x86:
add %r8, %rax
adc %r9, %rdx
You should take into account that the libgmp authors have a huge amount of experience in implementing operations with large integers on a very large number of CPU architectures, i.e. on all architectures supported by gcc, and for most of those architectures libgmp has been the fastest during many years, or it still is the fastest.
"I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project"
Utter horse manure.
Perhaps something similar is needed within ISAs / CPUs ? Say an OS kernel, a ZIP-algorithm, Mandelbrot, Fizz-buzz ... could measure code compactness but also performance and energy usage.
Everything should be written in C, or some scripting language implemented in C. Writing safe code is easy, just wrap everything in layers of macros that the compiler will magically optimize away, and if it doesn't, computers are fast enough anyway, right? The mark of a real programmer is that every one of their source files includes megabytes of headers defining things like __GNU__EXTENSION_FOO_BAR_F__UNDERSCORE_.
You say your processor has a single instruction to do some extremely common operation, and want to use it? You shouldn't even be reading a processor manual unless you are working on one of the two approved compilers, preferably GCC! If you are very lucky, those compiler people that are so much smarter than you could hope to be, have already implemented some clever transformation that recognizes the specific kind of expression produced by a set of deeply nested macros, and turns them into that single instruction. In the process, it will helpfully remove null pointer checks because you are relying on undefined behaviour somewhere else.
You say you'll do it in assembly? For Kernighan's sake, think about portability!!! I mean, portable to any other system that more or less looks the same as UNIX, with a generous sprinkling of #ifdefs and a configure script that takes minutes to run.
Implement a better language? Sure, as long as the compiler is written in C, preferably outputs C source code (that is then run through GCC), and the output binary must of course link against the system's C library. You can't do it any other way, and every proper UNIX - BSD or Mac OS X - will make it literally impossible by preventing syscalls from any other piece of code.
IMO this is like a cultural virus that seems to have infected everything IT-related, and I don't exactly understand why. Sure, having all these layers of cruft down below lets us build the next web app faster, but isn't it normal to want to fix things? Do some people actually get a sense of satisfaction out of saying "It is a solved problem, don't reinvent the wheel"? Or do they want to think that their knowledge of UNIX and C intricacies is somehow the most important, fundamental thing in computer science?
Isn't this the classic RISC vs CISC problem?
Comparing x86/ARM to RISC-V feels like Apples to Grains of Rice.
If RISC-V was born out of a need for an open source embedded ISA, would the ISA not need to remain very RISC-like to accommodate implementations with fewer available transistors... Or is this an outdated assumption?
Maybe SISC - "Simplified" instruction set computing, perhaps. ARM isn't exactly super complicated in this particular aspect (it is elsewhere), but in this case the designers basically chose to make branches simpler at the expense of code that needs to check overflows (or flags more generally)
RISC-V was born partly out of a desire for a teaching ISA, also, so simplicity is a boon in that context too.
Whether the similar awkwardness applies to a lot of other code or not is not being told by this isolated case.
Moderators where are you?
I'm not a fan of the RISC-V design but the presence or absence of this instruction doesn't make it a terrible architecture.
It does not matter much, because there is a sequence of dependent instructions, which cannot be executed in parallel, regardless which is the maximum IPC of a RISC-V CPU.
The opinions from those messages matter, because they belong to experts in implementing operations with large integers on a lot of different CPU architectures, with high performance proven during decades of ubiquitous use of their code. They have certainly a better track record than any RISC-V designer.
It doesn't matter how great something else could be in theory if it doesn't exist or doesn't meet the same scale and mindshare (or adoption).