“Risc V greatly underperforms” (opens in new tab)

(gmplib.org)

310 pointsoxxoxoxooo4y ago348 comments

348 comments

168 comments · 39 top-level

snvzz4y ago· 25 in thread

I don't think they even tried to read the ISA spec documents. If they did, they would have found that the rationale for most of these decisions is solid: Evidence was considered, all the factors were weighted, and decisions were made accordingly.

But ultimately, the gist of their argument is this:

>Any task will require more Risc V instructions that any contemporary instruction set.

Which is easy to verify as utter nonsense. There's not even a need to look at the research, which shows RISC-V as the clear winner in code density. It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.

adrian_b4y ago

I am sorry but saying that RISC-V is a winner in code density is beyond ridiculous.

I am familiar with many tens of instruction sets, since the first computers with vacuum tubes until all the important instruction sets that are still in use, and there is no doubt that RISC-V requires more instructions and a larger code size than almost all of them, for doing any task.

Even the hard-to-believe "research" results published by RISC-V developers have always showed worse code density than ARM, the so-called better results were for the compressed extension, not for the normal encoding.

Moreover, the results for RISC-V are hugely influenced by the programming language and the compiler options that are chosen. RISC-V has an acceptable code size only for unsafe code, if the programming language or the compiler options require run-time checks, to ensure safe behavior, then the RISC-V code size increases enormously, while for other CPUs it barely changes.

The RISC-V ISA has only 1 good feature for code size, the combined compare-and-branch instructions. Because there typically is 1 branch for every 6 to 8 instructions, using 1 instruction instead of 2 saves a lot.

Except for this good feature, the rest of the ISA is full of bad features, which frequently require at least 2 instructions instead of 1 instruction in any other CPU, e.g. the lack of indexed addressing, which is needed in any loop that must access some aggregate data structure, in order to be able to implement the loop with a minimum number of instructions.

snvzz4y ago

You seem to be making your whole argument around some facts which you got wrong. The central points of your argument are often used in FUD, thus they are definitely worth tackling here.

>Even the hard-to-believe "research" results published by RISC-V developers have always showed worse code density than ARM

the code size advantage of RISC-V is not artificial academic bullshit. It is real, it is huge, and it is trivial to verify. Just build any non-trivial application from source with a common compiler (such as GCC or LLVM's clang) and compare the sizes you get. Or look at the sizes of binaries in Linux distributions.

>the so-called better results were for the compressed extension, not for the normal encoding.

The C extension can be used anywhere, as long as the CPU supports the extension; most RISC-V profiles require it. This is in stark contrast with ARMv7's thumb, which was a literal separate CPU mode. Effort was put in making this very cheap for the decoder.

The common patterns where number of instructions is larger are made irrelevant by fusion. RISC-V has been thoroughly designed with fusion in mind, and is unique in this regard. It is within its right in calling itself the 5th generation RISC ISA because of this, even if everything else is ignored.

Fusion will turn most of these "2 instructions instead of one" into actually one instruction from the execution unit perspective. There's opportunities everywhere for fusion, the patterns are designed in. The cost of fusion on RISC-V is also very low, often quoted as 400 gates, allowing even simpler microarchitectures to implement it.

3 more replies

orra4y ago

> the so-called better results were for the compressed extension, not for the normal encoding.

Ignoring RISC-V’s compressed encoding seems a rather artificial restriction.

2 more replies

dragontamer4y ago

> The RISC-V ISA has only 1 good feature for code size, the combined compare-and-branch instructions. Because there typically is 1 branch for every 6 to 8 instructions, using 1 instruction instead of 2 saves a lot.

Which isn't really a big advantage, because ARM and x86 macro-op fuse those instructions together. (That is, those 2-instructions are decoded and executed as 1x macro-op in practice).

cmp /jnz on x86 is like, 4-bytes as well. So 4-bytes on x86 vs 4-bytes on RISC-V. 1-macro-op on x86 vs 1-instruction on RISC-V.

So they're equal in practice.

-----

ARM is 8-bytes, but macro-op decoded. So 1-macro op on ARM but 8-bytes used up.

2 more replies

audunw4y ago

> I am sorry but saying that RISC-V is a winner in code density is beyond ridiculous.

You have no idea what you're talking about. I've worked on designs with both ARM and RISC-V cores. The RISC-V code outperforms the ARM core, with smaller gate count, and has similar or higher code density in real world code, depending on the extensions supported. The only way you get much lower code density is without the C extension, but I haven't seen it not implemented in a real-world commercial core, and if it wasn't, I'm sure there was because of a benefit (FPGAs sometimes use ultra-simple cores for some tasks, and don't always care about instruction throughput or density)

It should be said that my experience is in embedded, so yes, it's unsafe code. But the embedded use-case is also the most mature. I wouldn't be surprised if extensions that help with safer programming languages would be added for desktop/server class CPUs, if they haven't already (I haven't followed the development of the spec that closely recently)

4 more replies

zamadatix4y ago

> Except for this good feature, the rest of the ISA is full of bad features

What are your thoughts on the way RISC V handled the compressed instructions subset?

3 more replies

theresistor4y ago

> I don't think they even tried to read the ISA spec documents. If they did, they would have found that the rationale for most of these decisions is solid: Evidence was considered, all the factors were weighted, and decisions were made accordingly.

It's perfectly possible to have read the spec and disagree with the rationale provided. RISC-V is in fact the outlier among ISAs in many of these design decisions, so there's a heavy burden of proof to demonstrate that making the contrary decisions in many cases was the right call.

> Which is easy to verify as utter nonsense. There's not even a need to look at the research, which shows RISC-V as the clear winner in code density. It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.

This doesn't seem to be true when you actually do an apples-to-apples comparison.

Taking as an example the build of Bash in Debian Sid (https://packages.debian.org/sid/shells/bash). I chose this because I'm pretty confident there's no functional or build-dependency difference that will be relevant here. Other examples like the Linux kernel are harder to compare because the code in question is different across architectures. I saw the same trend in the GCC package, so it's not an isolated example.

riscv64 installed size: 6,157.0 kB amd64 installed size: 6,450.0 kB arm64 installed size: 6,497.0 kB armhf installed size: 6,041.0 kB

RV64 is outperforming the other 64-bit architectures, but under-performing 32-bit ARM. This is consistent with expectations: amd64 has a size penalty due to REX bytes, arm64 got rid of compressed instructions to enable higher performance, and armhf (32-bit) has smaller constants embedded in the binary.

Compressed instructions definitely do work for making code smaller, and that's part of why arm32 has been very successful in the embedded space, and why that space hasn't been rushing to adopt arm64. For arm32, however, compressed instructions proved to be a limiting factor on high performance implementation, and arm64 moved away from them because of it. Maybe that's due to some particular limitations of arm32's compressed instructions that RISC-V compressed instructions won't suffer from, but that remains to be proven.

adrian_b4y ago

The size of the files can be very misleading, because a large part of the files can be filled with various tables with additional information, with strings, with debugging information, with empty spaces left for alignment to page boundaries and so on. So the size of the installed files is not necessarily correlated with the code size.

To compare the code sizes, you need tools like "size", "readelf" etc. and the data given by the tools should still be studied, to see how much of the code sections really contain code.

I have never seen until now a program where the RISC-V variant is smaller than the ARMv8 or Intel/AMD variant, and I doubt very much that such a program can exist. Except for the branches, where RISC-V frequently needs only 4 bytes instead of 5 bytes for Intel/AMD or 8 bytes for ARMv8, for all the other instructions it is very frequent to need 8 bytes for RISC-V instead of 4 bytes for ARMv8.

Moreover, choosing compiler options like -fsanitize for RISC-V increases the number of instructions dramatically, because there is no hardware support for things like overflow detection.

2 more replies

akiselev4y ago

> RISC-V is in fact the outlier among ISAs in many of these design decisions, so there's a heavy burden of proof to demonstrate that making the contrary decisions in many cases was the right call.

Genuinely asking, why? Do we think RISC-V should, or even could, try to compete against the AMD/Intel/ARM behemoths on their playing field? Obviously ISAs are a low level detail and far removed from the end product, but it feels like the architectural decisions we are "stuck with" today are inextricably intertwined with their contemporary market conditions and historical happenstance. It feels like all the experimental architectures that lost to x86/ARM (including Intel's own) were simply too much too soon, before ubiquitous internet and the open source culture could establish itself. We've now got companies using genetic algorithms to optimize ICs and people making their own semiconductors in the 100s of microns range in their garages - maybe it's time to rethink some things!

(EE in a past life but little experience designing ICs so I feel like I'm talking out of my rear end)

1 more reply

btdmaster4y ago

Unfortunately, it seems that, at least for gmp, the shared objects balloon in comparison to all other architectures. It is about three times bigger (6000 instead of 2000kB): https://packages.debian.org/sid/libgmp-dev. I am hopeful that this may improve with extensions, though I know little about the details.

1 more reply

mianos4y ago

Probably because, like most applications, that one does not have a lot of wide multiplications. It is hard not the turn this point into an insult at the OP.

ant6n4y ago

Perhaps thumb2 makes an 8-wide decide much harder. Plus, then you can't have 32 instead of 16 registers.

xiphias24y ago

A company creating embedded risc-v cpus also has some added extra instruction set extensions that conflict with the floating point instructions though.

audunw4y ago

Yeah, I'm not sure he takes into considered compressed instructions, which can be used anywhere, rather than being a separate mode like Thumb on ARM.

Fusing instructions isn't just theoretical either. I'm pretty sure it is or will be a common optimisation for CPUs aiming for high performance. How exactly is two easily-fused 16-bit instructions worse than one 32-bit one? Is there really a practical difference other than the name of the instruction(s)?

At the same time, the reduced transistor count you get from a simpler instruction set is not a benefit to be just dismissed either. I'm starting to see RISC-V cores being put all over the place in complex microcontrollers, because they're so damn cheap, yet have very decent performance. I know a guy developing a RISC-V core. He was involved with the proposal for a couple of instructions that would put the code density above Thumb for most code, and the performance of his core was better than Cortex-M0 at a similar or smaller gate count. I'm not sure if the instructions was added to the standard or not though.

Even for high performance CPUs, there's a case to be made for requiring fewer transistors for the base implementation. It makes it easier to make low-power low-leakage cores for the heterogeneous architecture (big.little, M1, etc.) which is becoming so popular.

phkahler4y ago

>> But ultimately, the gist of their argument is this...

Funny, I thought the whole thing was bitching that RISC V has no carry flag which obviously causes multi word arithmetic to take more instructions. The obvious workaround is to use half-words and use the upper half for carry. There may be better solutions, but at twice the number of instructions this "dumb" method is better than what the author did.

Flags were removed because they cause a lot of unwanted dependencies and contention in hardware designs and they aren't even part of any high level language.

I still think instead of compare-and-branch they should have made "if" which would execute the following instruction only if true. But that's just just an opinion. I also hate the immediate constants (12 bits?) Inside the instruction. Nothing wrong with 16 32 or 64bit immediate data after the opcode.

I hope RISC 6 will come along down the road (not soon) and fix a few things. But I like the lack of flags...

kortex4y ago

So if I'm understanding this particular tussle correctly, carry flags are problematic for optimization because they create implicit mutable shared global state, which isn't necessarily reflected in the machine code.

Risc-v basically says "lets make the implicit, explicit" and you have to essentially use registers to store the carry information when operating on bigints. Which for the current impl means chaining more instructions.

Is that correct?

That sounds like what the FP crowd is always talking about - eschewing shared state so it's easier to reason about, optimize, parallelize, etc.

1 more reply

quotemstr4y ago

Nevertheless, the ISA speaks for itself. The goal of a technical project is to produce a technical artifact, not to generate good feelings having followed a "solid" process.

If the process you followed brought you to this, of what use was the process?

kazinator4y ago

> It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.

Also, the godbolt.org compiler explorer has Risc-V support: useful for someone interested in comparing specific snippets of code.

robert_foss4y ago

So how would you suggest re-writing their example in less than 6 instructions for RISC-V? X86/arm both have instructions that include the carry operation for long additions, and only require 2 instructions.

jolmg4y ago

I don't even see the issue. RISC-V is supposed to be a RISC-type ISA. It's in the very name. That it takes more instructions when compared to a CISC-type ISA like x86 is completely normal.

https://en.wikipedia.org/wiki/Reduced_instruction_set_comput...

1 more reply

rbanffy4y ago

I don't think there is anything preventing the processor to fuse those instructions into a single operation once they are decoded.

2 more replies

vitno4y ago

Any != All. There is a difference between synthetic benchmarks and real world test cases.

2 more replies

smoldesu4y ago

I don't think you're supposed to. The compiler handles that stuff, ideally RISC-V is just another compilation target.

1 more reply

bubblethink4y ago

>There's not even a need to look at the research, which shows RISC-V as the clear winner in code density. It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.

Wait, if we are talking about actual ISA instructions, why is it hard to believe that RISC-V would have more of them ? The argument in favor of RISC is to simplify the frontend because even for a complex ISA like x86, the instructions will get converted to many micro-ops. In terms of actual ISA instructions, it seems quite reasonable that x86 would have fewer of those (at the cost of frontend complexity).

snvzz4y ago

Code density is about requiring the least amount of bytes of code to get something done, on average.

Doing it using a small pool of instructions, too (as RISC-V does), is the cherry on top.

Taniwha4y ago· 20 in thread

So this is one tiny corner of the ISA, not something that makes ALL instruction sequences longer - essentially RISCV has no condition codes (they're a bit of an architectural nightmare for everyone doing any more than the simplest CPUs, they make every instruction potentially have dependencies or anti-dependencies with every other).

It's a trade off - and the one that's been made is one that makes it possible to make ALL instructions a little faster at the expense of one particular case that isn't used much - that's how you do computer architecture, you look at the whole, not just one particular case

RISCV also specifies a 128-bit variant that is of course FASTER than these examples

sanxiyn4y ago

RISC-V designers optimized for C and found overflow flag isn't used much and got rid of it. It was the wrong choice: overflow flag is used a lot for JavaScript and any language with arbitrary precision integer (including GMP, the topic of OP).

aidenn04y ago

Over just the time I've been aware of things, there's been a constant positive feedback loop of "checked overflow isn't used by software, so CPU designers make it less performant" followed by "Checked overflow is less performant so software uses it less."

I wish there was a way out.

Language features are also often implemented at least partly because they can be done efficiently on the premiere hardware for the language. Then new hardware can make such features hard to implement.

WASM implemented return values in a way that was different from register hardware, and it makes efficient codegen of Common Lisp more challenging. This was brought to the attention of the committee while WASM was still in flux, and they (perhaps rightfully) decided CL was insufficiently important to change things.

I'm sure that people brought up the overflow situation to the RISC-V designers, and it was similarly dismissed. It's just unfortunate that legacy software is such a big driver of CPU features as that's a race towards lowest-common-denominator hardware.

2 more replies

roca4y ago

Also Rust applications are increasingly going to be built with integer overflow checking enabled, e.g. Android's Rust components are going to ship with integer overflow checking. And unlike say GMP, that poses a potential code density problem because we're not talking about inner loops that can be effectively cached, it's code bloat smeared across the entire binary.

2 more replies

kannanvijayan4y ago

It kind of chafed when I excitedly read the ISA docs and found that overflow testing was cumbersome.

That said, I think it's less of an issue these days for JS implementors in particular. It might have mattered more back in the day when pure JS carried a lot of numeric compute load and there weren't other options. These days it's better to stow that compute code in wasm and get predictable reliable performance and move on.

The big pain points in perf optimization for JS is objects and their representation, functions and their various type-specializations.

Another factor is that JS impls use int32s as their internal integer representation, so there should be some relatively straightforward approach involving lifting to int64s and testing the high half for overflow.

Still kind of cumbersome.

There are similar issues in existing ISAs. NaN-boxing for example uses high bits to store type info for boxed values. Unboxing boxed values on amd64 involves loading an 8-byte constant into a free register and then using that to mask out the type. The register usage is mandatory because you can't use 64-bit values as immediates.

I remember trying to reduce code size and improve perf (and save a scratch register) by turning that into a left-shift right-shift sequence involving no constants, but that led to the code executing measurably slower as it introduced data dependencies.

1 more reply

audunw4y ago

Which is exactly the right trade-off for embedded CPUs, where RISC-V is most popular right now.

If desktop/server-class RISC-V CPUs become more common, it's not unreasonable to think they'll add an extension that covers the needs of managed/higher-level languages like RISC-V.

Even for server-class CPUs you could argue that you absolutely want this extension to be optional, as you can design more efficient CPUs for datacenters/supercomputers where you know what kind of code you'll be running.

1 more reply

zozbot2344y ago

They provide recommended insn sequences for overflow checking as commentary to the ISA specification, and this enables efficient implementation in hardware.

2 more replies

wyldfire4y ago

Is it worth the encoding space to define new ALU instructions with overflow flag semantics? Or could there be an executive format that implies a different mode?

theresistor4y ago

This isn't an isolated case. RISC-V makes the same basic tradeoff (simplicity above all else) across the board. You can see this in the (lack of) addressing modes, compare-and-branch, etc.

Where this really bites you is in workloads dominated by tight loops (image processing, cryptography, HPC, etc). While a microarchitecture may be more efficient thanks to simpler instructions (ignoring the added complexity of compressed instructions and macro-fusion, the usual suggested fixes...), it's not going to be 2-3x faster, so it's never going to compensate for a 2-3x larger inner loop.

lottospm4y ago

I'm not an expert on ISA and CPU internals, but an X86 instruction is not just "an instruction" anymore. Afaik, since the P6 arch Intel is using a fancy decoder to translate x86/-64 CISC into an internal RISC ISA (up to 4 u-ops per CISC instruction) and that internal ISA could be quite close to the RISC-V ISA for all I know.

Instruction decoding and memory ordering can be a bit of nightmare on CISC ISAs and fewer macro-instructions are not automatically a win. I guess we'll eventually see in benchmarks.

Even though Intel has had decades to refine their CPUs I'm quite excited to see where RISC-V is going.

3 more replies

rstuart41334y ago

> it's not going to be 2-3x faster, so it's never going to compensate for a 2-3x larger inner loop.

As someone else who replied said, I'm not a CPU architect, just software that works close to the metal. That means I pay attention to compiler output.

What you say is true in the very early says: compilers did indeed use the x86's addressing modes in all sorts of odd ways to squeeze as many calculations as possible into as few bytes as possible. Then it went in the reverse direction. You started seeing compilers emitting long series of simple instructions instead, seemingly deliberately avoiding those complex addressing modes. And now it's swung back again - I'm the complier using addressing modes to shift plus a couple of adds in one instruction is common again. I presume all these shifts were driven by speed of the resulting code.

I have no idea why one method was faster than the other - but clearly there is no hard and fast rule operating here. For some internal x86's implementations using complex addressing modes was a win. On some, for exactly the same instruction set, it wasn't. There is no cut and dried "best" way of doing it, rather it varies as the transistor and power budget changes.

One thing we do know about RISC-V is it is intended to cover a _lot_ transistor and power budgets. Where it's used now (low power / low transistor) is turned out their design decisions have turned out _very_ well, far better than x86.

More fascinatingly to me, today the biggest speed ups compilers get for super scalar arch's has nothing to do with the addressing modes so much attention is being focused on here. It comes from avoiding conditional jumps. The compilers will often emit code that evaluates both paths of the computation (thus burning 50% more ALU time on a calculating a result that will never be used), then choose the result they want with a cmov. In extreme cases, I've seen doing that sort of thing gain them a factor of 10, which is far more than playing tiddly winks with addressing modes will get you.

I have no idea how that will pan out for RISC-V. I don't think any one has done a super scalar implementation of it yet(?) But in the non-super scalar implementations the RISC-V instruction set choices have worked out very well so far. And when someone does do a super scalar implementation (and I'm sure there will be a lot of different implementations over time), it seems very possible x86's learnings on addressing mode use will be yesterdays news.

nickez4y ago

For those use cases you typically have specialised hardware or an FPGA.

2 more replies

classichasclass4y ago

So do what Power does: most instructions that update the condition flags can do so optionally (except for instructions like stdcx. or cmpd where they're meaningless without it, and corner oddballs like andi.). For that matter, Power treats things like overflow and carry as separate from the condition register (they go in a special purpose register), so you can issue an instruction like addco. or just a regular add with no flags, and the condition register actually is divided into eight, so you can operate on separate fields without dependencies.

crest4y ago

IIRC few Power(PC) cores really split the condition register nibbles into 8 renamable registers and while Power(PC) includes everything (including at least two spare kitchen sinks) only a few instructions can pick which condition register nibble to update. Most integer instructions can only update cr0 and floating point instructions cr1. On the other hand you can do nice hacks with the cornucopia of bitwise available bitwise operations on condition register bits and it's one of the architectures where (floating point) comparisons return the full set of results (less, equal, greater, unordered).

1 more reply

jabl4y ago

ARM also does something similar, many instructions has a flag bit specifying whether flags should be updated or not. It doesn't have the multiple flag registers of POWER though.

1 more reply

monocasa4y ago

> they're a bit of an architectural nightmare for everyone doing any more than the simplest CPUs, they make every instruction potentially have dependencies or anti-dependencies with every other

It's doesn't have to be _that_ bad. As long as condition flags are all written at once (or are essentially banked like PowerPC) the dependency issue can go away because they're renamed and their results aren't dependent on previous data.

Now, of course, instructions that only update some condition flags and preserve others are the devil.

bitwize4y ago

It's not a tiny corner. People do arithmetic with carry all the time. Arbitrary precision arithmetic is more common than you think. Congratulations, RISC-V, you've not only slowed down every bignum implementation in existence, all those extra instructions to compute carry will blow the I$ faster, potentially slowing down any code that relies on a bignum implementation as well.

andrekandre4y ago

  > all those extra instructions to compute carry will blow the I$ faster

i think the idea is, as others have mentioned, the add/comp instructions are fused internally to a single instruction, so probably its not that bad for i$ as we might think?

user-the-name4y ago

> RISCV also specifies a 128-bit variant that is of course FASTER than these examples

Is it actually implemented on any hardware?

ncmncm4y ago

No. Mentioning it is only meant to distract.

1 more reply

wbl4y ago

You can opt in to generating and propagating conditions and rename the predicates as well.

sosodev4y ago· 13 in thread

Why do these half baked slam pieces always make it to the top of HN?

meepmorp4y ago

It's not a "slam piece," it's an email from a listserv, send two months ago. Someone realized it'd be catnip for people on HN and posted it.

rbanffy4y ago

It's not even a rock-solid critique...

1 more reply

jgilias4y ago

Many people upvote things not necessarily because they agree with them, but rather to bump it in hopes that someone with good insights will chime in in the comments section.

This especially applies to potentially controversial things.

rm4454y ago

Yeah. I'm not qualified to judge the quality of an instruction set, but this writer destroyed all credibility with me by claiming that an undergraduate could design a better architecture (than this enormous collective effort) in a term. It's right up there with claiming you could create Spotify in a weekend or whatever.

rbanffy4y ago

I designed an ISA (and a CPU) as an undergrad, and I assure you that, while it was very cool (stack-oriented, ASM looked like Forth), it'd have horrendous performance these days.

AnIdiotOnTheNet4y ago

You say that as though Design By Committee isn't a thing.

1 more reply

gary_04y ago

The entire post is full of hyperbole, but the example they show looks like a legitimate complaint.

bob10294y ago

I think the reason is that it ultimately encourages deep and thoughtful conversation. If nothing controversial was ever proposed, the motivation for participating and "proving others wrong" is lessened. It might not be the healthiest way, but I certainly find myself putting a lot more thought into my comments if its a contrary point or in some broader controversial context.

Overall, I feel HN is most fun when a lot of people are in disagreement but also operating in good faith.

boibombeiro4y ago

Standing ground, specially when we are wrong, helps to learn a lot more about the subject.

jillesvangurp4y ago

Good question. The answer is that HN is community driven and that what ends up on it reflects on the qualities of its diverse readership.

But I agree that this bit of writing comes across as a bit overly assertive and arrogant; and probably trivially proved wrong by actually running some benchmarks.

By the same reasoning, the Apple M1 would obviously be slower than anything Intel and AMD produce given similar energy and transistor density constraints (i.e. same class of hardware). Except that obviously isn't the case and we have the Macbook air with the M1 more than holding up against much more expensive Intel/AMD chips. Reason: chips don't actually work like this person seems to assume. The whole article is a sandcastle of bad assumptions leading up to an arrogantly worded & wrong conclusion.

Avamander4y ago

I want to see what people say against it.

chillingeffect4y ago

if for no other reason than to quickly formulate counterarguments. Next time at some meeting or other get together, if someone pipes up with an anti-RISC comment, most people won't be able to quickly refute it. But having had this discussion here, we're inocculated and able to respond with intelligence and experience.

okl4y ago

That sounds like you make up your mind first, then look for arguments that support your position. I'd rather see the arguments before I come to conclusions.

jpfr4y ago· 12 in thread

The idea is to use the compressed instruction extension. Then two adjacent instructions can be handled like a single “fat” instruction with a special case implementation.

That allows more flexibility for CPU designs to optimize transistor count vs speed vs energy consumption.

This guy clearly did not look at the stated rationale for the design decisions of RISC-V.

theresistor4y ago

Compressed instructions and macro-fusion aren't magical solutions. It's not always possible to convince the compiler to generate the magical sequence required, and it actually makes high-performance implementations (wide superscalar) more difficult thanks to the variable width decoding.

Beyond that, compressed instructions are not a 1:1 substitute for more complex instructions, because a pair of compressed instructions cannot have any fields that cross the 16-bit boundary. This means you can't recover things like larger load/store offsets.

Additionally, you can't discard architectural state changes due to the first instruction. If you want to fuse an address computation with a load, you still have to write the new address to the register destination of the address computation. If you want to perform clever fusion for carry propagation, you still have to perform all of the GPR writes. This is work that a more complex instruction simply wouldn't have to perform, and again it complicates a high performance implementation.

panick21_4y ago

Part of the idea is to create standard ways to do certain things and then hope compiler writers generation code according to that. That will allow more chip designers to take advantage of those if they want to.

They spent a lot of time and effort on making sure the decoding pretty good and useful for high performance implementations.

RISC-V is designed for very small and very large system. At some point some tradeoffs need to be made but these are very reasonable and most of the time no a huge problem.

For the really specialized cases where you simply can't live with those extra instruction, those will be added to the standard and then some profiles will include them and others not. If those instructions are really as vital as those that want them claim, they will find their way into many profiles.

Saying RISC-V is 'terrible' because of those choices is not fair way of evaluating it.

1 more reply

jpfr4y ago

In the context of gmp, people write architecture-specific assembly for the inner loop anyway.

Besides that, you raise good points on sources of complexity. I’m waiting for the benchmarks once such developments have been incorporated. Everything else is guesswork.

1 more reply

audunw4y ago

> and it actually makes high-performance implementations (wide superscalar) more difficult thanks to the variable width decoding.

More difficult than x86? We're talking about a damn simple variable width decoding here.

I could imagine RISC-V with C extension being more tricky than 64-bit ARM. Maybe.

> and again it complicates a high performance implementation.

But so much of the rationale behind the design of RISC-V is to simplify high performance implementation in other ways. So the big question is what the net effect is.

The other big question is if extensions will be added to optimise for desktop/server workloads by the time RISC-V CPUs penetrate that market significantly.

socialdemocrat4y ago

I don’t see why offsets larger than 16-bit are important. Are you implying that most fusion candidate pairs would need this? In tight inner loops why would you need large offsets?

Of course you discard architectural state changes in fusion. If I have a bunch of instructions which end up reading from memory into register x10, then I can fuse with all previous instructions which wrote into x10, as their results get clobbered anyway.

Disclaimer: I may have misunderstood the point you made. However you don’t seem to make it clear how fusion is bad for performance.

What performance tricks are you giving up by doing fusion?

imtringued4y ago

Let's assume you are right. In 5 years the organization behind RISC-V apologizes and introduces a "bignum" extension. That doesn't sound too bad.

msbarnett4y ago

He literally addressed this, albeit obliquely, in the message

> I have heard that Risc V proponents say that these problems are known and could be fixed by having the hardware fuse dependent instructions. Perhaps that could lessen the instruction set shortcomings, but will it fix the 3x worse performance for cases like the one outlined here?

Macro-fusion can to some extent offset the weak instruction set, but you're never going to get a multiple integer multiplier speedup out of it given the complexity of inter-op architectural state changes that have to be preserved, and instruction boundary limitations involved; it's never going to offset a 3x blowup in instruction count in a tight loop.

socialdemocrat4y ago

Fusing 3 instructions is not unusual, those could also have been compressed. Thus you have no more microcode to execute and only 50% more cache usage rather than 300%

okl4y ago

Sweet spot seems to be 16-bit instructions with 32/64-bit registers. With 64-bit registers you need some clever way to load your immediates, e.g., like the shift/offset in ARM instructions.

alerighi4y ago

Even if you do so, the program size is still bigger, and it consumes more disk, RAM and most importantly cache space. Wasting cache for having multiple instructions when on another architecture it's done by only one doesn't make particular sense to me.

Also, it's said that x86 is bad because the instructions are then reorganized and translated inside the CPU. But it seems that you are proposing the same, the CPU that preprocessed the instructions and fuses some into a single one (the opposite that x86 does). Ad that point, it seems to me that what x86 does makes more sense: have a ton of instruction (and thus smaller programs and thus more code that can fit in cache) and split them, rather than having a ton of instructions (and waste cache space) for then the CPU to combine them into a single one (a thing that a compiler can also do).

socialdemocrat4y ago

x86 also does macro fusion. Difference is RISC-V was designed for compressed instruction and fusion from the get go. X86 bolted this on.

Anyway what you gain from this is a very simple ISA, which helps tool writers, those who implement hardware as well in academia for teaching and research.

How does the insanely complex x86 instructions help anyone?

Buttons8404y ago

How many cache misses are for program instructions, versus data misses?

2 more replies

dragontamer4y ago· 5 in thread

Hmmm... I think this argument is solid. Albeit biased from GMP's perspective, but bignums are used all the time in RSA / ECC, and probably other common tasks, so maybe its important enough to analyze at this level.

2-instructions to work with 64-bits, maybe 1 more instruction / macro-op for the compare-and-jump back up to a loop, and 1 more instruction for a loop counter of somekind?

So we're looking at ~4 instructions for 64-bits on ARM/x86, but ~9-instructions on RISC-V.

The loop will be performed in parallel in practice however due to Out-of-order / superscalar execution, so the discussion inside the post (2 instruction on x86 vs 7-instructions on RISC-V) probably is the closest to the truth.

----------

Question: is ~2-clock ticks per 64-bits really the ideal? I don't think so. It seems to me that bignum arithmetic is easily SIMD. Carries are NOT accounted for in x86 AVX or ARM NEON instructions, so x86, ARM, and RISC-V will probably be best.

I don't know exactly how to write a bignum addition loop in AVX off the top of my head. But I'd assume it'd be similar to the 7-instructions listed here, except... using 256-bit AVX-registers or 512-bit AVX512 registers.

So 7-instructions to perform 512-bits of bignum addition is 73-bits-per-clock cycle, far superior in speed to the 32-bits-per-clock cycle from add + adc (the 64-bit code with implicit condition codes).

AVX512 is uncommon, but AVX (256-bit) is common on x86 at least: leading to ~36-bits-per-clock tick.

----------

ARM has SVE, which is ambiguous (sometimes 128-bits, sometimes 512-bits). RISC-V has a bunch of competing vector instructions.

..........

Ultimately, I'm not convinced that the add + adc methodology here is best anymore for bignums. With a wide-enough vector, it seems more important to bring forth big 256-bit or 512-bit vector instructions for this use case?

EDIT: How many bits is the typical bignum? I think add+adc probably is best for 128, 256, or maybe even 512-bits. But moving up to 1024, 2048, or 4096 bits, SIMD might win out (hard to say without me writing code, but just a hunch).

2048-bit RSA is the common bignum, right? Any other bignums that are commonly used? EDIT2: Now that I think of it, addition isn't the common operation in RSA, but instead multiplication (and division which is based on multiplication).

serentty4y ago

> RISC-V has a bunch of competing vector instructions.

There is only one standard V extension. Alibaba made a chip with a prerelease version of that V extension which is thus incompatible with the final version, but in practice that just means that the vector unit on that chip is not used because it is incompatible, not that there are now competing standards

zik4y ago

GMP is basically a worst-case example since it uses a lot of overflow. The RISC-V architecture has been extensively studies and for most cases it's a little more dense than (say) ARM when compared like-for-like.

throwaway815234y ago

> So 7-instructions to perform 512-bits of bignum addition is 73-bits-per-clock cycle, far superior in speed to the 32-bits-per-clock cycle from add + adc (the 64-bit code with implicit condition codes).

add+adc should still be 64 bits per cycle. adc doesn't just add the carry bit, it's an add instruction which includes the usual operands, plus the carry bit from the previous add or adc.

Teknoman1174y ago

Can you treat the whole vector register as a single bignum on x86? If so, I totally missed that.

dragontamer4y ago

No.

Which is why I'm sure add / adc will still win at 128-bits, or 256-bits.

The main issue is that the vector-add instructions are missing carry-out entirely, so recreating the carry will be expensive. But with a big enough number, that carry propagation is parallelizable in log2(n), so a big enough bignum (like maybe 1024-bits) will probably be more efficient for SIMD.

1 more reply

okl4y ago· 5 in thread

Few years ago, I designed my own ISA. In that time I investigated design decisions in lots of ISAs and compared them. There was nothing in the RISC-V instruction set that stood out to me, like for example, the SuperH instruction set, which is remarkably well designed.

Edit: Don't get me wrong, I don't think RISC-V is "garbage" or anything like that. I just think it could have been better. But of course, most of an architecture's value comes from its ecosystem and the time spent optimizing and tailoring everything...

AlotOfReading4y ago

My memories of SuperH are a bit different. Yeah, it's cleaner than ARM, but the delay slots, hardware division, and the tiny register file among others made life unnecessarily difficult. A lot of those design decisions didn't hold up well over time.

okl4y ago

Interesting! From which perspective? Implementing the ISA, compiler or applications? Did you write machine language or compiled?

1 more reply

nocat4y ago

What are the particularly good design features of SuperH? (As compared to, say, MIPS?)

What sticks in my mind from my limited exposure to SuperH is that there's no load immediate instruction, so you have to do a PC-relative load instead. It was clearly optimized for compiled rather than handwritten code!

okl4y ago

> What sticks in my mind from my limited exposure to SuperH is that there's no load immediate instruction, so you have to do a PC-relative load instead.

SuperH has a mov #imm, Rx that can take an 8-bit #imm. But you're right, literal pools were used just like on ARM.

Things I liked about SuperH: 16 bit fixed-width insn format (except for some SH2A and DSP ops), T flag for bit manipulation ops, GBR to enable scaled loads with offset, xtrct instruction, single-cycle division insns (div0, div1), MAC insns.

In terms of code density SH was quite effective, see here http://web.eece.maine.edu/~vweaver/papers/iccd09/iccd09_dens... or here http://www.deater.net/weave/vmwprod/asm/ll/ll.html

audunw4y ago

Never heard of SuperH. I see it has branch delay slots, which is a seemingly clever but terrible idea. It's one of the reasons RISC-V quickly overtook OpenRISC in popularity I think.

Not having anything that stands out is perhaps a good thing. Being "clever" with the ISA tends to bite you when implementing OoO superscalar cores.

yjftsjthsd-h4y ago· 5 in thread

The original title was "Risc V greatly underperforms", which seems like a far more defensible and less inflammatory claim than "Risc V is a terrible architecture", which was picked from the actual message but still isn't the title.

dang4y ago

Fixed now. Thanks!

gary_04y ago

I almost skipped this thread because of the flamebait title. This is a debate over CPU instruction set performance details, nobody is going to die.

yjftsjthsd-h4y ago

In fairness, this is Hacker News; flame wars^w^w respectful but intense debate over editors, operating systems, and, yes, ISA details, is somewhat expected. (Although, yes, I'm not sure that I would get too worked up about this particular detail; even if the stated claim is 100% true and unmitigated, it means some kinds of code will have potentially bigger binaries. I understand a math library person caring, I don't think I care.)

2 more replies

Dylan168074y ago

I would say that "underperforms" is indefensible from such a simple analysis that doesn't touch IPC. "Terrible" is at least openly an opinion.

adrian_b4y ago

The IPC was discussed.

It does not matter much, because there is a sequence of dependent instructions, which cannot be executed in parallel, regardless which is the maximum IPC of a RISC-V CPU.

The opinions from those messages matter, because they belong to experts in implementing operations with large integers on a lot of different CPU architectures, with high performance proven during decades of ubiquitous use of their code. They have certainly a better track record than any RISC-V designer.

socialdemocrat4y ago· 4 in thread

RISC V is an opinionated architecture and that is always going to get some people fired up. Any technology that aims for simplicity has to make hard choices and trade offs. It isn’t hard to complain about missing instructions when there are less than 100 of them. Meanwhile nobody will complain about ARM64 missing instructions because it had about 1000 of them.

Therein lies the problem. Nobody ever goes out guns blazing complaining about too many instructions despite the fact that complexity has its own downsides.

RISC-V has been designed aggressively to have minimal ISA to leave plenty of room to grow, and require minimal number of transistors for a minimal solution.

Should this be a showstopper down the road, then there will be plenty of space to add an extensions that fixes this problem. Meanwhile embedded systems paying a premium for transistors are not going to have to pay for these extra instructions as only 47 instructions have to be implemented in a minimal solution.

imtringued4y ago

It reminds me of nutrition advice. The 70s said X is evil or bad. Then we discover, X doesn't matter.

I think in 10-20 years everyone will agree that all the "bad" RISC-V decisions don't matter. The same way x86 (CISC) was supposed to be bad because of legacy/backwards compatibility.

DeathArrow4y ago

I remember Jim Keller saying that for every ISA just 7 or 8 instructions are important and they are used for like 99% of the code. Load, store and the like.

rep_lodsb4y ago

That doesn't mean we should eliminate everything else. Many operations that would require 3 or more simple instructions are easy enough to implement in hardware, so that they take no more time than one of these simple instructions. This can be a big win even if they are only used 1% of the time.

klelatti4y ago

Fair point but Arm is not just Arm64 - if you want a simple low cost ISA then there is Cortex-M.

bArray4y ago· 4 in thread

> I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project.

When you hear the "<person / group> could make a better <implementation> in <short time period>" - call them out. Do it. The world will not shun a better open license ISA. We even have some pretty awesome FPGA boards these days that would allow you to prototype your own ISA at home.

In terms of the market - now is an exceptionally great time to go back to the design room. It's not as if anybody will be manufacturing much during the next year with all of the fab labs unable to make existing chips to meet demand. There is a window of opportunity here.

> It is, more-or-less a watered down version of the 30 year old Alpha ISA after all. (Alpha made sense at its time, with the transistor budget available at the time.)

As I see it, lower numbers of transistors could also be a good thing. It seems blatantly obvious at this point that multi-core software is not only here to stay, but is the future. Lower numbers of transistors means squeezing more cores onto the same silicon, or implementing larger caches, etc.

I also really like the Unix philosophy of doing one simple thing well. Sure, it could have some special instruction that does exactly your use case in one cycle using all the registers, but that's not what has created such advances in general purpose computing.

> Sure, it is "clean" but just to make it clean, there was no reason to be naive.

I would much rather we build upon a conceptually clean instruction set, rather than trying to hobble together hacks on top of fundamentally flawed designs - even at the cost of performance. It's exactly these hobbled conceptual hacks that have lead to the likes of spectre and meltdown vulnerabilities, when the instruction sets become so complicate that they cannot be easily tested.

kbelder4y ago

Yeah. I don't have a dog in this fight... I don't have strong opinions either way, and this is one of those arguments that will be settled by reality after some time goes by.

But the author making an argument like that...

> I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project.

Pretty much blew their credibility. It's obviously wrong, and a sensible, fair person wouldn't write it.

imtringued4y ago

This student gets an A because he added the "adcs" instruction to his ISA. Everyone else gets an F because they didn't.

michaelt4y ago

> Do it. The world will not shun a better open license ISA.

When you use a CPU architecture you don't just get an ISA.

You also get compilers and debuggers. Ready-to-run Linux images. JIT compilers for JavaScript and Java. Debian repos and Python wheels with binaries.

And you get CPUs with all the most complex features. Instruction re-ordering, branch prediction, multiple cores, multi-level caches, dynamic frequency and voltage control. You want an onboard GPU, with hardware 4k h264 encoding and decoding? No problem.

And you get a wealth of community knowledge - there are forum posts and StackOverflow questions where people might have encountered your problems before. If you're hiring, there are loads of engineers who've done a bit of stuff with that architecture before. And of course vendors actually making the silicon!

I've seen ISAs documented with a single sheet of A4 paper. The difficult part in having a successful CPU architecture is all the other stuff :)

imtringued4y ago

>As I see it, lower numbers of transistors could also be a good thing. It seems blatantly obvious at this point that multi-core software is not only here to stay, but is the future. Lower numbers of transistors means squeezing more cores onto the same silicon, or implementing larger caches, etc.

How about some 32 way SMT GPUs... No more divergence!

kelnos4y ago· 4 in thread

> My conclusion is that Risc V is a terrible architecture.

Kinda stopped reading here. It's a pretty arrogant hot take. I don't know this guy, maybe he's some sort of ISA expert. But it strains credulity that after all this time and work put into it, RISC-V is a "terrible architecture".

My expectation here is that RISC-V requires some inefficient instruction sequences in some corners somewhere (and one of these corners happens to be OP's pet use case), but by and large things are fine.

And even then, I don't think that's clear. You're not going to determine performance just by looking at a stream of instructions on modern CPUs. Hell, it's really hard to compare streams of instructions from different ISAs.

astrange4y ago

> Kinda stopped reading here. It's a pretty arrogant hot take. I don't know this guy, maybe he's some sort of ISA expert. But it strains credulity that after all this time and work put into it, RISC-V is a "terrible architecture".

Seems quite balanced with all the other replies here which claim it's the best architecture ever whenever anyone says anything about it.

I don't think its vector extensions would be good for video codecs because they seem designed around large vectors. (and the article the designers wrote about it was quite insulting to regular SIMD)

mlyle4y ago

> Seems quite balanced with all the other replies here which claim it's the best architecture ever whenever anyone says anything about it.

RISC-V is pretty good. Probably slightly better for some things than ARM, and slightly worse for others. It's open, which is awesome, and the instruction set lends itself to extensions which is nice (but possibly risks the ecosystem fragmenting). Building really high performance RISC-V designs looks like it's going to rely on slightly smarter instruction decoders than we've seen in the past for RISCs, but it doesn't look insurmountable.

mhh__4y ago

Calling it terrible is definitely something from the book of Linus T.

Bad? Quite possible, it was meant as a teaching ISA initially IIRC, but terrible? Who knows.

foxfluff4y ago

That's the difficulty here, people are already arguing past each other because nobody seems to agree what the ISA is for.

If you look at the early history of RISC-V, it does indeed look like as something built for teaching. But I don't think that use case warrants all the hype around it.

So how did all the hype form, and why is it that there are people seemingly hyping it as the next-gen dream-come-true super elegant open developed-with-hindsight ISA that will eventually displace crufty old x86 and proprietary ARM while offering better performance and better everything? Of course that just baits you into arguing about its potential performance. And don't worry if it doesn't have all the instructions you need for performance yet, we'll just slap it with another extension and it totally won't turn into a clusterfuck with a stench of legacy and numerous attempts at fixing it (coz' remember, hindsight)!

And then if you question its potential, you'll get someone else arguing that no no, it's not a high performance ISA for general use in desktops / servers, it's just an extensible ISA that companies can customize for their special sauce microcontrollers or whatever.

Of course it's all armchair speculation because there are no high performance real world implementations and there aren't enough experts you can trust.

Teknoman1174y ago· 4 in thread

A bit of a computer history question: I have never looked at the ISA of the Alpha (referenced in post), but RISC V has always struck me as being nearly identical to (early) MIPS, just without the HI and LO registers for multiply results and the addition of variable length instruction support, even if the core ISA doesn't use them.

MIPS didn't have a flag register either and depended on a dedicated zero register and slt instructions (set if less than)

okl4y ago

I bet this article on RISC-V's genealogy is interesting for you: https://live-risc-v.pantheonsite.io/wp-content/uploads/2016/...

cpeterso4y ago

Andrew Waterman's thesis ("Design of the RISC-V Instruction Set Architecture") has a very approachable comparison of RISC-V to MIPS, SPARC, Alpha, ARMv7, ARMv8, OpenRISC, and x86:

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-...

dfox4y ago

The no flags at all part is clearly inspired by Alpha including the rationale of flags being detrimental to OoO implementation.

MIPS is classical RISC design that was not designed to be OoO-friendly at all and is simply designed for ease of straightforward pipelined implementation. The reason why it does not have flags probably simply comes down to the observation that you don't need flags for C.

userbinator4y ago

Yes, that's exactly my thought every time it comes out; RISC-V is likely to displace MIPS everywhere performance doesn't matter, but it'll have a hard time competing with ARM or x86 on that.

jhallenworld4y ago· 4 in thread

What if the multi-precision code is written in C?

You can detect carry of (a+b) in C branch-free with: ((a&b) | ((a|b) & ~(a+b))) >> 31

So 64-bit add in C is:

   f_low = a_low + b_low
   c_high = ((a_low & b_low) | ((a_low | b_low) & ~f_low)) >> 31
   f_high = a_high + b_high + c_high

So for RISC-V in gcc 8.2.0 with -O2 -S -c

        add     a1,a3,a2
        or      a5,a3,a2
        not     a7,a1
        and     a5,a5,a7
        and     a3,a3,a2
        or      a5,a5,a3
        srli    a5,a5,31
        add     a4,a4,a6
        add     a4,a4,a5

But for ARM I get (with gcc 9.3.1):

        add     ip, r2, r1
        orr     r3, r2, r1
        and     r1, r1, r2
        bic     r3, r3, ip
        orr     r3, r3, r1
        lsr     r3, r3, #31
        add     r2, r2, lr
        add     r2, r2, r3

It's shorter because ARM has bic. Neither one figures out to use carry related instructions.

Ah! But! There is a gcc macro: __builtin_uadd_overflow() that replaces the first two C lines above: c_high = __builtin_uadd_overflow(a_low, b_low, &f_low);

So with this:

RISC-V:

        add     a3,a4,a3
        sltu    a4,a3,a4
        add     a5,a5,a2
        add     a5,a5,a4

ARM:

        adds    r2, r3, r2
        movcs   r1, #1
        movcc   r1, #0
        add     r3, r3, ip
        add     r3, r3, r1

RISC-V is faster..

EDIT: CLANG has one better: __builtin_addc().

    f_low = __builtin_addcl(a_low, b_low, 0, &c);
    f_high = __builtin_addcl(a_high, b_high, c, &junk);

x86:

        addl    8(%rdi), %eax
        adcl    4(%rdi), %ecx

ARM:

        adds    w8, w8, w10
        add     w9, w11, w9
        cinc    w9, w9, hs

RISC-V:

        add     a1, a4, a5
        add     a6, a2, a3
        sltu    a2, a2, a3
        add     a6, a6, a2

volta834y ago

> RISC-V is faster..

I find it funny that you make the same pitfall than the author did.

Faster on which CPU?

The author doesn't measure on any CPU, so here there are dozens of people hypothesizing whether fusion happens or not, and what the impact is.

jhallenworld4y ago

All other things equal, you would prefer smaller code for better cache use.

1 more reply

brutal_chaos_4y ago

> Faster on which CPU?

Perhaps faster means fewer instructions in this instance? Considering number of instructions is what has been discussed.

1 more reply

brucehoult4y ago

Note that with the newly-ratified B extension, RISC-V has BIC (called ANDN) as well as ORN and XNOR.

In addition to the actual ALU instructions doing the add with carry, for bignums it's important to include the load and store instructions. Even in L1 cache it's typically 2 or 3 or 4 cycles to do the load, which makes one or two extra instructions for the arithmetic less important. Once you get to bignums large enough to stream from RAM (e.g. calculating pi to a few billion digits) it's completely irrelevant.

ksec4y ago· 4 in thread

The unwritten rule of HN:

You do not criticise The Rusted Holy Grail and the Riscy Silver Bullet.

mhh__4y ago

Rust is pretty well regarded technically here, but RISC-V is mostly a "wouldn't it be nice" rather than something most commenters seem to be super knowledgeable about.

Many people still think that RISC-V implies an open source implementation, for example.

rbanffy4y ago

Or, if you do, you'd better be absolutely right, or people will tear your argument to shreds.

sophacles4y ago

It's a sad state - who wants to have their random inaccurate theory debunked by facts? Imagination land is way, way more fun.

DeathArrow4y ago

You can do worse than that. You can take the Linux's name in vain.

SavantIdiot4y ago· 3 in thread

A bit off topic, but when did a DWORD implicitly become 64bits?

veltas4y ago

Lots of ISAs consider 32-bit to be a 'word'. And now they have the same problem that Intel already encountered that it's easier to start referring to a new larger native word size as a 'double-word' and the cycle continues...

SavantIdiot4y ago

Clearly I'm stuck in late-1980's-x86-ASM-land. BYTE, WORD, DWORD[, QWORD].

robertlagrant4y ago

I heard the bird was DWORD.

Symmetry4y ago· 2 in thread

I think talking about ISAs as better or worse than one another is often a bad idea for the same reason that arguing about whether C or Python is better is a bad idea. Different ISAs are used for different purposes. We can point to some specific things as almost always being bad in the modern world like branch delay slots or the way the C preprocessor works but even then for widely employed languages or ISAs there was a point to it when it was created.

RISC-V has a number of places it's employed where it makes an excellent fit. First of all academia. For an undergrad making building the netlist for their first processor or a grad student doing their first out of order processor RISC-V's simplicity is great for the pedagogical purpose. For a researcher trying to experiment with better branch prediction techniques having a standard high-ish performance open source design they can take and modify with their ideas is immensely helpful. And for many companies in the real world with their eyes on the bottom line like having an ISA where you can add instructions that happen to accelerate your own particular workload, where you can use a standard compiler framework outside your special assembly inner loops, and where you don't have to spend transistors on features you don't need.

I'm not optimistic about RISC-V's widescale adoption as an application processor. If I were going to start designing an open source processor in that space I'd probably start with IBM's now open Power ISA. But there are so many more niches in the world than just that and RISC-V is already a success in some of them.

okl4y ago

Branch delay slots are an artifact of a simple pipeline without speculation. There's nothing inherently "bad" about them.

pm2154y ago

If you're designing a single CPU that definitely has a simple pipeline, branch delay slots are maybe justifiable. If you're designing an architecture which you hope will eventually be used by many CPU designs which might have a variety of design approaches, then delay slots are pretty bad because every future CPU that isn't a simple non-speculating pipeline will have to do extra work to fake up the behaviour. This is an example of a general principle, which is that it's usually a mistake to let microarchitectural details leak into the architecture -- they quickly go stale and then both hw and sw have to carry the burden of them.

nynx4y ago· 2 in thread

If this really is an issue, I imagine risc-v could easily get an extension for adding/subtracting/etc simd vectors together in a way that would expand to the capabilities of underlying processor without requiring hardcoding.

kayamon4y ago

It already has this.

nynx4y ago

Yes, the simd extension has the flexible vectors thing. I don't think it has a way to treat simd vectors as bigints.

pcwalton4y ago· 2 in thread

Doesn't RISC-V have an add-with-carry instruction as part of the vector extension? I see it listed here: https://github.com/riscv/riscv-v-spec/releases/tag/v1.0

monocasa4y ago

Afaict that's only for operations the vector register file. Most of the complaints about the lack of addc/subc are around how they're heavily used in JITs for languages that want to speculatively optimize multi precision arthimetic into the integer register file for their regular integer ops. JavaScript, a lot of Lisps, the MLs all fit into that space.

pcwalton4y ago

Sure, but this email is in the context of GMP, which should be using the vector extension, no?

1 more reply

CalChris4y ago· 2 in thread

TL;DR

  My code snippet results in bloated code for RISC-V RV64I.

I'm not sure how bloated it is. All of those instructions will compress [1].

[1] https://riscv.org/wp-content/uploads/2015/05/riscv-compresse...

It's slower on RISC-V but not a lot on a superscalar. The x86 and ARMv8 snippets have 2 cycles of latency. The RISC-V has 4 cycles of latency.

  1. add  t0, a4, a6  add  t1, a5, a7
  2. sltu t6, t0, a4  sltu t2, t1, a5
  3. add  t4, t1, t6  sltu t3, t4, t1
  4.                  add  t6, t2, t3

I'm not getting terrible from this.

Koffiepoeder4y ago

CPU performance increases nowadays often are measured in single digit percentages because the margins became so thin. Doubling the cycles is a 100% increase. You can call that not so bloated, but I think many people would beg to differ.

On the other hand I take this article with a grain of salt anyhow, since it only discusses a single example. I think we would need a lot more optimized assembly snippet comparisons to make meaningful conclusions (and even then there could be authored selection bias).

snvzz4y ago

The article's approach to arguing against RISC-V is fairly childish.

>"here's this snippet, it takes more instructions on RISC-V, thus RISC-V bad"

Is pretty much what it's saying. An actual argument about ISA design would weight the cost this has with the advantages of not having flags, provide a body of evidence and draw conclusions from it. But, of course, that would be much harder to do.

What's comparatively easy and they should have done, however, is to read the ISA specification. Alongside the decisions that were made, there's a rationale to support it. Most of these choices, particularly so the ones often quoted in FUD as controversial or bad, have a wealth of papers, backed by plentiful evidence, behind them.

dlsa4y ago· 2 in thread

I noticed high and low in there so those code snippets look like 32 bit code, at least to me.

Is that even a fair comparison given the arm and x86 versions used as examples of "better" were 64 bit?

If we're really comparing 32 and 64 and complaining that 32 bit uses more instructions than 64, perhaps we should dig out the 4 bit processors and really sharpen the pitchforks. Alternatively, we could simply not. Comparing apples to oranges doesn't really help.

From the article:

Let's look at some examples of how Risc V underperforms.

First, addition of a double-word integer with carry-out:

add t0, a4, a6 // add low words

sltu t6, t0, a4 // compute carry-out from low add

add t1, a5, a7 // add hi words

sltu t2, t1, a5 // compute carry-out from high add

add t4, t1, t6 // add carry to low result

sltu t3, t4, t1 // compute carry out from the carry add

add t6, t2, t3 // combine carries

Same for 64-bit arm:

adds x12, x6, x10

adcs x13, x7, x11

Same for 64-bit x86:

add %r8, %rax

adc %r9, %rdx

adrian_b4y ago

The comparison is completely fair, because on RISC-V there is no better way to generate the carries required for computations with large integers. You cannot generate a carry with a 64-bit addition, because it is lost and you cannot store it.

You should take into account that the libgmp authors have a huge amount of experience in implementing operations with large integers on a very large number of CPU architectures, i.e. on all architectures supported by gcc, and for most of those architectures libgmp has been the fastest during many years, or it still is the fastest.

dlsa4y ago

So the 32 bit code and the 64 bit code is equally inefficient in your opinion?

throwaway199374y ago· 2 in thread

TL;DR RISC-V doesn't have add with carry.

I'm not a fan of the RISC-V design but the presence or absence of this instruction doesn't make it a terrible architecture.

stephencanon4y ago

_For the purposes of implementing multi-word arithmetic_, which is Torbjörn's whole deal, it kind of does. (Also the actual post subject is "greatly underperforms").

FullyFunctional4y ago

It's meaningless to look at the code in absence an implementation and conclude anything about the performance. He doesn't know what the performance is. Having six instruction vs. two does not mean one is 3X faster than the other. It means nothing at all.

1 more reply

marcodiego4y ago· 1 in thread

So, how meaningful is the "projected score of 11+ SPECInt2006/GHz" as claimed here: https://www.sifive.com/press/sifive-raises-risc-v-performanc... ?

Symmetry4y ago

I expect it to be true but not very meaningful. Clock efficiency in isolation isn't any more useful a figure of merit than clock frequency in isolation. The fact that they're using a metric like that makes me pessimistic about the chip.

fhood4y ago· 1 in thread

Oh wow, everybody else is debating the specific intricacies of the design decisions, and I'm here wondering why you would complain about not enough instructions in an architecture with "RISC" in the name.

adrian_b4y ago

The RISC idea was to not include in the ISA instructions so complex that they would require a multi-cycle implementation.

The minimum duration of the clock cycle of a modern CPU is essentially determined by the duration of a 64-bit integer addition/subtraction, because such operations need a latency of only 1 clock cycle to be useful.

Operations that are more complex than 64-bit integer addition/subtraction, e.g. integer multiplications or floating-point operations, need multiple cycles, but they are pipelined so that their throughput remains at 1 per cycle.

So 64-bit addition/subtraction is certainly expected to be included in any RISC ISA.

The hardware adders used for addition/subtraction provide, at a negligible additional cost, 2 extra bits, carry and overflow, which are needed for operations with large integers and for safe operations with 64-bit integers.

The problem is that the RISC-V ISA does not offer access to those 2 bits and generating them in software requires a very large cost in execution time and in lost energy in comparison with generating them in hardware.

I do not see any relationship between these bits and the RISC concepts, omitting them does not simplify the hardware, but it makes the software more complex and inefficient.

xondono4y ago· 1 in thread

Experimenting with RISC-V is one of those things I keep postponing.

For those are more versed, is this really a general problem?

I was under the impression that the real bottleneck is memory, and things like this would be fixed in real applications through out of order execution, and that it payed off having simpler instructions because compilers had more freedom to rearrange things.

fwsgonzo4y ago

RISC-V is completely fine, heavily based on research and well thought out. It does have pros and cons like any other architecture, and for what it does well, it does it really well!

tomxor4y ago· 1 in thread

> Let's look at some examples (7 instructions vs 2 vs 2)

Isn't this the classic RISC vs CISC problem?

Comparing x86/ARM to RISC-V feels like Apples to Grains of Rice.

If RISC-V was born out of a need for an open source embedded ISA, would the ISA not need to remain very RISC-like to accommodate implementations with fewer available transistors... Or is this an outdated assumption?

mhh__4y ago

It's not exactly RISC VS CISC.

Maybe SISC - "Simplified" instruction set computing, perhaps. ARM isn't exactly super complicated in this particular aspect (it is elsewhere), but in this case the designers basically chose to make branches simpler at the expense of code that needs to check overflows (or flags more generally)

RISC-V was born partly out of a desire for a teaching ISA, also, so simplicity is a boon in that context too.

aappleby4y ago· 1 in thread

The author seems to be assuming that the designers have never thought about this corner case.

sanxiyn4y ago

No, the author is arguing this is not a corner case but a central? case. I tend to agree.

kazinator4y ago

Godbolt:

  typedef __int128_t int128_t;

  int128_t add(int128_t left, int128_t right)
  {
    return left + right;
  }

GCC 10, -O2, RISC-V:

  add(__int128, __int128):
        mv      a5,a0
        add     a0,a0,a2
        sltu    a5,a0,a5
        add     a1,a1,a3
        add     a1,a5,a1
        ret

ARM64:

  add(__int128, __int128):
        adds    x0, x0, x2
        adc     x1, x1, x3
        ret

This issue hurts the wider types that are compiler built-ins.

Even though C has a programming model that is devoid of any carry flag concept, canned types like a 128 bit integer can take advantage of it.

Portable C code to simulate a 128 bit integer will probably emit bad code across the board. The code will explicitly calculate the carry as an additional operand and pull it into the result. The RISC-V won't look any worse, then, in all likelihood.

(The above RISC-V instruction set sequence is shorter than the mailing list post author's 7 line sequence because it doesn't calculate a carry out: the result is truncated. You'd need a carry out to continue a wider addition.)

jasonhansel4y ago

One thing that bothers me: RISC-V seems to use up a lot of the available instruction set space with "HINT" instructions that nobody has (yet) found a use for. Is it anticipated that all of the available HINTs will actually be used, or is the hope that the compressed version of the instruction set will avoid the wasted space?

DeathArrow4y ago

Technical lead for SoC architecture at Nokia dismissed Risc V: https://www.quora.com/Is-RISC-V-the-future/answer/Heikki-Kul...

bell-cot4y ago

Rather than glib hand-waving in front of the chalkboard...might there be a decent piece or few of RISC V hardware, which could actually be compared to non-RISC V hardware with similar budgets (for design work, transistor count, etc.) - to see how things work out when running substantial pieces of decently-compiled code?

YesThatTom24y ago

I call this "benchmark by visual inspection". It is completely useless. Yet, many top devs that I know seem to think that they can emulate a complex chip in their head better than... the chip itself.

mda4y ago

Honestly in my eyes, author loses all credibility after saying this:

"I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project"

Utter horse manure.

throwaway4good4y ago

Inside AI you have all these standard data sets (ImageNet, MNIST etc.) that act as a benchmark for how well an algorithm performs within a given area (character recognintion, image recognintion etc.).

Perhaps something similar is needed within ISAs / CPUs ? Say an OS kernel, a ZIP-algorithm, Mandelbrot, Fizz-buzz ... could measure code compactness but also performance and energy usage.

rep_lodsb4y ago

Carry flag and overflow checking? We don't need those things because C doesn't support it! That sadly seems to be the kind of thought process behind RISC-V, and a lot of other "modern" computing:

Everything should be written in C, or some scripting language implemented in C. Writing safe code is easy, just wrap everything in layers of macros that the compiler will magically optimize away, and if it doesn't, computers are fast enough anyway, right? The mark of a real programmer is that every one of their source files includes megabytes of headers defining things like __GNU__EXTENSION_FOO_BAR_F__UNDERSCORE_.

You say your processor has a single instruction to do some extremely common operation, and want to use it? You shouldn't even be reading a processor manual unless you are working on one of the two approved compilers, preferably GCC! If you are very lucky, those compiler people that are so much smarter than you could hope to be, have already implemented some clever transformation that recognizes the specific kind of expression produced by a set of deeply nested macros, and turns them into that single instruction. In the process, it will helpfully remove null pointer checks because you are relying on undefined behaviour somewhere else.

You say you'll do it in assembly? For Kernighan's sake, think about portability!!! I mean, portable to any other system that more or less looks the same as UNIX, with a generous sprinkling of #ifdefs and a configure script that takes minutes to run.

Implement a better language? Sure, as long as the compiler is written in C, preferably outputs C source code (that is then run through GCC), and the output binary must of course link against the system's C library. You can't do it any other way, and every proper UNIX - BSD or Mac OS X - will make it literally impossible by preventing syscalls from any other piece of code.

IMO this is like a cultural virus that seems to have infected everything IT-related, and I don't exactly understand why. Sure, having all these layers of cruft down below lets us build the next web app faster, but isn't it normal to want to fix things? Do some people actually get a sense of satisfaction out of saying "It is a solved problem, don't reinvent the wheel"? Or do they want to think that their knowledge of UNIX and C intricacies is somehow the most important, fundamental thing in computer science?

adapteva4y ago

Over 200 comments and not a single benchmark comparison, if only there was some way to settle this argument...sigh.

usr11064y ago

The code given is arbitrary precision addition. How often do you need that in general computing? Hardly often enough to make a measurable difference.

Whether the similar awkwardness applies to a lot of other code or not is not being told by this isolated case.

Shadonototra4y ago

Who changed the title?

Moderators where are you?

akimball4y ago

The sad thing is that many people will just read the headline or the original email and walk away misinformed and indeed disinformed

kayamon4y ago

"Gee no carry flag how will we cope?"

oneplane4y ago

All of the discussions about instruction sets and "mine is better than yours" or "anyone else could do better in a small amount of time" are useless considering those arguments, if true, haven't actually resulted in any free ISA being available broadly, embraced broadly and hardware implementing that ISA being available.

It doesn't matter how great something else could be in theory if it doesn't exist or doesn't meet the same scale and mindshare (or adoption).

j / k navigate · click thread line to collapse

348 comments

168 comments · 39 top-level

snvzz4y ago· 25 in thread

But ultimately, the gist of their argument is this:

>Any task will require more Risc V instructions that any contemporary instruction set.

adrian_b4y ago

I am sorry but saying that RISC-V is a winner in code density is beyond ridiculous.

snvzz4y ago

You seem to be making your whole argument around some facts which you got wrong. The central points of your argument are often used in FUD, thus they are definitely worth tackling here.

>Even the hard-to-believe "research" results published by RISC-V developers have always showed worse code density than ARM

>the so-called better results were for the compressed extension, not for the normal encoding.

3 more replies

orra4y ago

> the so-called better results were for the compressed extension, not for the normal encoding.

Ignoring RISC-V’s compressed encoding seems a rather artificial restriction.

2 more replies

dragontamer4y ago

Which isn't really a big advantage, because ARM and x86 macro-op fuse those instructions together. (That is, those 2-instructions are decoded and executed as 1x macro-op in practice).

cmp /jnz on x86 is like, 4-bytes as well. So 4-bytes on x86 vs 4-bytes on RISC-V. 1-macro-op on x86 vs 1-instruction on RISC-V.

So they're equal in practice.

-----

ARM is 8-bytes, but macro-op decoded. So 1-macro op on ARM but 8-bytes used up.

2 more replies

audunw4y ago

> I am sorry but saying that RISC-V is a winner in code density is beyond ridiculous.

4 more replies

zamadatix4y ago

> Except for this good feature, the rest of the ISA is full of bad features

What are your thoughts on the way RISC V handled the compressed instructions subset?

3 more replies

theresistor4y ago

This doesn't seem to be true when you actually do an apples-to-apples comparison.

riscv64 installed size: 6,157.0 kB amd64 installed size: 6,450.0 kB arm64 installed size: 6,497.0 kB armhf installed size: 6,041.0 kB

adrian_b4y ago

To compare the code sizes, you need tools like "size", "readelf" etc. and the data given by the tools should still be studied, to see how much of the code sections really contain code.

Moreover, choosing compiler options like -fsanitize for RISC-V increases the number of instructions dramatically, because there is no hardware support for things like overflow detection.

2 more replies

akiselev4y ago

> RISC-V is in fact the outlier among ISAs in many of these design decisions, so there's a heavy burden of proof to demonstrate that making the contrary decisions in many cases was the right call.

(EE in a past life but little experience designing ICs so I feel like I'm talking out of my rear end)

1 more reply

btdmaster4y ago

1 more reply

mianos4y ago

Probably because, like most applications, that one does not have a lot of wide multiplications. It is hard not the turn this point into an insult at the OP.

ant6n4y ago

Perhaps thumb2 makes an 8-wide decide much harder. Plus, then you can't have 32 instead of 16 registers.

xiphias24y ago

A company creating embedded risc-v cpus also has some added extra instruction set extensions that conflict with the floating point instructions though.

audunw4y ago

Yeah, I'm not sure he takes into considered compressed instructions, which can be used anywhere, rather than being a separate mode like Thumb on ARM.

phkahler4y ago

>> But ultimately, the gist of their argument is this...

Flags were removed because they cause a lot of unwanted dependencies and contention in hardware designs and they aren't even part of any high level language.

I hope RISC 6 will come along down the road (not soon) and fix a few things. But I like the lack of flags...

kortex4y ago

Is that correct?

That sounds like what the FP crowd is always talking about - eschewing shared state so it's easier to reason about, optimize, parallelize, etc.

1 more reply

quotemstr4y ago

Nevertheless, the ISA speaks for itself. The goal of a technical project is to produce a technical artifact, not to generate good feelings having followed a "solid" process.

If the process you followed brought you to this, of what use was the process?

kazinator4y ago

> It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.

Also, the godbolt.org compiler explorer has Risc-V support: useful for someone interested in comparing specific snippets of code.

robert_foss4y ago

jolmg4y ago

I don't even see the issue. RISC-V is supposed to be a RISC-type ISA. It's in the very name. That it takes more instructions when compared to a CISC-type ISA like x86 is completely normal.

https://en.wikipedia.org/wiki/Reduced_instruction_set_comput...

1 more reply

rbanffy4y ago

I don't think there is anything preventing the processor to fuse those instructions into a single operation once they are decoded.

2 more replies

vitno4y ago

Any != All. There is a difference between synthetic benchmarks and real world test cases.

2 more replies

smoldesu4y ago

I don't think you're supposed to. The compiler handles that stuff, ideally RISC-V is just another compilation target.

1 more reply

bubblethink4y ago

snvzz4y ago

Code density is about requiring the least amount of bytes of code to get something done, on average.

Doing it using a small pool of instructions, too (as RISC-V does), is the cherry on top.

Taniwha4y ago· 20 in thread

RISCV also specifies a 128-bit variant that is of course FASTER than these examples

sanxiyn4y ago

aidenn04y ago

I wish there was a way out.

2 more replies

roca4y ago

2 more replies

kannanvijayan4y ago

It kind of chafed when I excitedly read the ISA docs and found that overflow testing was cumbersome.

The big pain points in perf optimization for JS is objects and their representation, functions and their various type-specializations.

Still kind of cumbersome.

1 more reply

audunw4y ago

Which is exactly the right trade-off for embedded CPUs, where RISC-V is most popular right now.

If desktop/server-class RISC-V CPUs become more common, it's not unreasonable to think they'll add an extension that covers the needs of managed/higher-level languages like RISC-V.

1 more reply

zozbot2344y ago

They provide recommended insn sequences for overflow checking as commentary to the ISA specification, and this enables efficient implementation in hardware.

2 more replies

wyldfire4y ago

Is it worth the encoding space to define new ALU instructions with overflow flag semantics? Or could there be an executive format that implies a different mode?

theresistor4y ago

This isn't an isolated case. RISC-V makes the same basic tradeoff (simplicity above all else) across the board. You can see this in the (lack of) addressing modes, compare-and-branch, etc.

lottospm4y ago

Instruction decoding and memory ordering can be a bit of nightmare on CISC ISAs and fewer macro-instructions are not automatically a win. I guess we'll eventually see in benchmarks.

Even though Intel has had decades to refine their CPUs I'm quite excited to see where RISC-V is going.

3 more replies

rstuart41334y ago

> it's not going to be 2-3x faster, so it's never going to compensate for a 2-3x larger inner loop.

As someone else who replied said, I'm not a CPU architect, just software that works close to the metal. That means I pay attention to compiler output.

nickez4y ago

For those use cases you typically have specialised hardware or an FPGA.

2 more replies

classichasclass4y ago

crest4y ago

1 more reply

jabl4y ago

ARM also does something similar, many instructions has a flag bit specifying whether flags should be updated or not. It doesn't have the multiple flag registers of POWER though.

1 more reply

monocasa4y ago

> they're a bit of an architectural nightmare for everyone doing any more than the simplest CPUs, they make every instruction potentially have dependencies or anti-dependencies with every other

Now, of course, instructions that only update some condition flags and preserve others are the devil.

bitwize4y ago

andrekandre4y ago

  > all those extra instructions to compute carry will blow the I$ faster

i think the idea is, as others have mentioned, the add/comp instructions are fused internally to a single instruction, so probably its not that bad for i$ as we might think?

user-the-name4y ago

> RISCV also specifies a 128-bit variant that is of course FASTER than these examples

Is it actually implemented on any hardware?

ncmncm4y ago

No. Mentioning it is only meant to distract.

1 more reply

wbl4y ago

You can opt in to generating and propagating conditions and rename the predicates as well.

sosodev4y ago· 13 in thread

Why do these half baked slam pieces always make it to the top of HN?

meepmorp4y ago

It's not a "slam piece," it's an email from a listserv, send two months ago. Someone realized it'd be catnip for people on HN and posted it.

rbanffy4y ago

It's not even a rock-solid critique...

1 more reply

jgilias4y ago

Many people upvote things not necessarily because they agree with them, but rather to bump it in hopes that someone with good insights will chime in in the comments section.

This especially applies to potentially controversial things.

rm4454y ago

rbanffy4y ago

I designed an ISA (and a CPU) as an undergrad, and I assure you that, while it was very cool (stack-oriented, ASM looked like Forth), it'd have horrendous performance these days.

AnIdiotOnTheNet4y ago

You say that as though Design By Committee isn't a thing.

1 more reply

gary_04y ago

The entire post is full of hyperbole, but the example they show looks like a legitimate complaint.

bob10294y ago

Overall, I feel HN is most fun when a lot of people are in disagreement but also operating in good faith.

boibombeiro4y ago

Standing ground, specially when we are wrong, helps to learn a lot more about the subject.

jillesvangurp4y ago

Good question. The answer is that HN is community driven and that what ends up on it reflects on the qualities of its diverse readership.

But I agree that this bit of writing comes across as a bit overly assertive and arrogant; and probably trivially proved wrong by actually running some benchmarks.

Avamander4y ago

I want to see what people say against it.

chillingeffect4y ago

okl4y ago

That sounds like you make up your mind first, then look for arguments that support your position. I'd rather see the arguments before I come to conclusions.

jpfr4y ago· 12 in thread

The idea is to use the compressed instruction extension. Then two adjacent instructions can be handled like a single “fat” instruction with a special case implementation.

That allows more flexibility for CPU designs to optimize transistor count vs speed vs energy consumption.

This guy clearly did not look at the stated rationale for the design decisions of RISC-V.

theresistor4y ago

panick21_4y ago

They spent a lot of time and effort on making sure the decoding pretty good and useful for high performance implementations.

RISC-V is designed for very small and very large system. At some point some tradeoffs need to be made but these are very reasonable and most of the time no a huge problem.

Saying RISC-V is 'terrible' because of those choices is not fair way of evaluating it.

1 more reply

jpfr4y ago

In the context of gmp, people write architecture-specific assembly for the inner loop anyway.

Besides that, you raise good points on sources of complexity. I’m waiting for the benchmarks once such developments have been incorporated. Everything else is guesswork.

1 more reply

audunw4y ago

> and it actually makes high-performance implementations (wide superscalar) more difficult thanks to the variable width decoding.

More difficult than x86? We're talking about a damn simple variable width decoding here.

I could imagine RISC-V with C extension being more tricky than 64-bit ARM. Maybe.

> and again it complicates a high performance implementation.

But so much of the rationale behind the design of RISC-V is to simplify high performance implementation in other ways. So the big question is what the net effect is.

The other big question is if extensions will be added to optimise for desktop/server workloads by the time RISC-V CPUs penetrate that market significantly.

socialdemocrat4y ago

I don’t see why offsets larger than 16-bit are important. Are you implying that most fusion candidate pairs would need this? In tight inner loops why would you need large offsets?

Disclaimer: I may have misunderstood the point you made. However you don’t seem to make it clear how fusion is bad for performance.

What performance tricks are you giving up by doing fusion?

imtringued4y ago

Let's assume you are right. In 5 years the organization behind RISC-V apologizes and introduces a "bignum" extension. That doesn't sound too bad.

msbarnett4y ago

He literally addressed this, albeit obliquely, in the message

socialdemocrat4y ago

Fusing 3 instructions is not unusual, those could also have been compressed. Thus you have no more microcode to execute and only 50% more cache usage rather than 300%

okl4y ago

Sweet spot seems to be 16-bit instructions with 32/64-bit registers. With 64-bit registers you need some clever way to load your immediates, e.g., like the shift/offset in ARM instructions.

alerighi4y ago

socialdemocrat4y ago

x86 also does macro fusion. Difference is RISC-V was designed for compressed instruction and fusion from the get go. X86 bolted this on.

Anyway what you gain from this is a very simple ISA, which helps tool writers, those who implement hardware as well in academia for teaching and research.

How does the insanely complex x86 instructions help anyone?

Buttons8404y ago

How many cache misses are for program instructions, versus data misses?

2 more replies

dragontamer4y ago· 5 in thread

2-instructions to work with 64-bits, maybe 1 more instruction / macro-op for the compare-and-jump back up to a loop, and 1 more instruction for a loop counter of somekind?

So we're looking at ~4 instructions for 64-bits on ARM/x86, but ~9-instructions on RISC-V.

----------

AVX512 is uncommon, but AVX (256-bit) is common on x86 at least: leading to ~36-bits-per-clock tick.

----------

ARM has SVE, which is ambiguous (sometimes 128-bits, sometimes 512-bits). RISC-V has a bunch of competing vector instructions.

..........

serentty4y ago

> RISC-V has a bunch of competing vector instructions.

zik4y ago

throwaway815234y ago

add+adc should still be 64 bits per cycle. adc doesn't just add the carry bit, it's an add instruction which includes the usual operands, plus the carry bit from the previous add or adc.

Teknoman1174y ago

Can you treat the whole vector register as a single bignum on x86? If so, I totally missed that.

dragontamer4y ago

No.

Which is why I'm sure add / adc will still win at 128-bits, or 256-bits.

1 more reply

okl4y ago· 5 in thread

AlotOfReading4y ago

okl4y ago

Interesting! From which perspective? Implementing the ISA, compiler or applications? Did you write machine language or compiled?

1 more reply

nocat4y ago

What are the particularly good design features of SuperH? (As compared to, say, MIPS?)

okl4y ago

> What sticks in my mind from my limited exposure to SuperH is that there's no load immediate instruction, so you have to do a PC-relative load instead.

SuperH has a mov #imm, Rx that can take an 8-bit #imm. But you're right, literal pools were used just like on ARM.

In terms of code density SH was quite effective, see here http://web.eece.maine.edu/~vweaver/papers/iccd09/iccd09_dens... or here http://www.deater.net/weave/vmwprod/asm/ll/ll.html

audunw4y ago

Never heard of SuperH. I see it has branch delay slots, which is a seemingly clever but terrible idea. It's one of the reasons RISC-V quickly overtook OpenRISC in popularity I think.

Not having anything that stands out is perhaps a good thing. Being "clever" with the ISA tends to bite you when implementing OoO superscalar cores.

yjftsjthsd-h4y ago· 5 in thread

dang4y ago

Fixed now. Thanks!

gary_04y ago

I almost skipped this thread because of the flamebait title. This is a debate over CPU instruction set performance details, nobody is going to die.

yjftsjthsd-h4y ago

2 more replies

Dylan168074y ago

I would say that "underperforms" is indefensible from such a simple analysis that doesn't touch IPC. "Terrible" is at least openly an opinion.

adrian_b4y ago

The IPC was discussed.

It does not matter much, because there is a sequence of dependent instructions, which cannot be executed in parallel, regardless which is the maximum IPC of a RISC-V CPU.

socialdemocrat4y ago· 4 in thread

Therein lies the problem. Nobody ever goes out guns blazing complaining about too many instructions despite the fact that complexity has its own downsides.

RISC-V has been designed aggressively to have minimal ISA to leave plenty of room to grow, and require minimal number of transistors for a minimal solution.

imtringued4y ago

It reminds me of nutrition advice. The 70s said X is evil or bad. Then we discover, X doesn't matter.

I think in 10-20 years everyone will agree that all the "bad" RISC-V decisions don't matter. The same way x86 (CISC) was supposed to be bad because of legacy/backwards compatibility.

DeathArrow4y ago

I remember Jim Keller saying that for every ISA just 7 or 8 instructions are important and they are used for like 99% of the code. Load, store and the like.

rep_lodsb4y ago

klelatti4y ago

Fair point but Arm is not just Arm64 - if you want a simple low cost ISA then there is Cortex-M.

bArray4y ago· 4 in thread

> I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project.

> It is, more-or-less a watered down version of the 30 year old Alpha ISA after all. (Alpha made sense at its time, with the transistor budget available at the time.)

> Sure, it is "clean" but just to make it clean, there was no reason to be naive.

kbelder4y ago

Yeah. I don't have a dog in this fight... I don't have strong opinions either way, and this is one of those arguments that will be settled by reality after some time goes by.

But the author making an argument like that...

> I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project.

Pretty much blew their credibility. It's obviously wrong, and a sensible, fair person wouldn't write it.

imtringued4y ago

This student gets an A because he added the "adcs" instruction to his ISA. Everyone else gets an F because they didn't.

michaelt4y ago

> Do it. The world will not shun a better open license ISA.

When you use a CPU architecture you don't just get an ISA.

You also get compilers and debuggers. Ready-to-run Linux images. JIT compilers for JavaScript and Java. Debian repos and Python wheels with binaries.

I've seen ISAs documented with a single sheet of A4 paper. The difficult part in having a successful CPU architecture is all the other stuff :)

imtringued4y ago

How about some 32 way SMT GPUs... No more divergence!

kelnos4y ago· 4 in thread

> My conclusion is that Risc V is a terrible architecture.

astrange4y ago

Seems quite balanced with all the other replies here which claim it's the best architecture ever whenever anyone says anything about it.

I don't think its vector extensions would be good for video codecs because they seem designed around large vectors. (and the article the designers wrote about it was quite insulting to regular SIMD)

mlyle4y ago

> Seems quite balanced with all the other replies here which claim it's the best architecture ever whenever anyone says anything about it.

mhh__4y ago

Calling it terrible is definitely something from the book of Linus T.

Bad? Quite possible, it was meant as a teaching ISA initially IIRC, but terrible? Who knows.

foxfluff4y ago

That's the difficulty here, people are already arguing past each other because nobody seems to agree what the ISA is for.

If you look at the early history of RISC-V, it does indeed look like as something built for teaching. But I don't think that use case warrants all the hype around it.

Of course it's all armchair speculation because there are no high performance real world implementations and there aren't enough experts you can trust.

Teknoman1174y ago· 4 in thread

MIPS didn't have a flag register either and depended on a dedicated zero register and slt instructions (set if less than)

okl4y ago

I bet this article on RISC-V's genealogy is interesting for you: https://live-risc-v.pantheonsite.io/wp-content/uploads/2016/...

cpeterso4y ago

Andrew Waterman's thesis ("Design of the RISC-V Instruction Set Architecture") has a very approachable comparison of RISC-V to MIPS, SPARC, Alpha, ARMv7, ARMv8, OpenRISC, and x86:

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-...

dfox4y ago

The no flags at all part is clearly inspired by Alpha including the rationale of flags being detrimental to OoO implementation.

userbinator4y ago

Yes, that's exactly my thought every time it comes out; RISC-V is likely to displace MIPS everywhere performance doesn't matter, but it'll have a hard time competing with ARM or x86 on that.

jhallenworld4y ago· 4 in thread

What if the multi-precision code is written in C?

You can detect carry of (a+b) in C branch-free with: ((a&b) | ((a|b) & ~(a+b))) >> 31

So 64-bit add in C is:

   f_low = a_low + b_low
   c_high = ((a_low & b_low) | ((a_low | b_low) & ~f_low)) >> 31
   f_high = a_high + b_high + c_high

So for RISC-V in gcc 8.2.0 with -O2 -S -c

        add     a1,a3,a2
        or      a5,a3,a2
        not     a7,a1
        and     a5,a5,a7
        and     a3,a3,a2
        or      a5,a5,a3
        srli    a5,a5,31
        add     a4,a4,a6
        add     a4,a4,a5

But for ARM I get (with gcc 9.3.1):

        add     ip, r2, r1
        orr     r3, r2, r1
        and     r1, r1, r2
        bic     r3, r3, ip
        orr     r3, r3, r1
        lsr     r3, r3, #31
        add     r2, r2, lr
        add     r2, r2, r3

It's shorter because ARM has bic. Neither one figures out to use carry related instructions.

Ah! But! There is a gcc macro: __builtin_uadd_overflow() that replaces the first two C lines above: c_high = __builtin_uadd_overflow(a_low, b_low, &f_low);

So with this:

RISC-V:

        add     a3,a4,a3
        sltu    a4,a3,a4
        add     a5,a5,a2
        add     a5,a5,a4

ARM:

        adds    r2, r3, r2
        movcs   r1, #1
        movcc   r1, #0
        add     r3, r3, ip
        add     r3, r3, r1

RISC-V is faster..

EDIT: CLANG has one better: __builtin_addc().

    f_low = __builtin_addcl(a_low, b_low, 0, &c);
    f_high = __builtin_addcl(a_high, b_high, c, &junk);

x86:

        addl    8(%rdi), %eax
        adcl    4(%rdi), %ecx

ARM:

        adds    w8, w8, w10
        add     w9, w11, w9
        cinc    w9, w9, hs

RISC-V:

        add     a1, a4, a5
        add     a6, a2, a3
        sltu    a2, a2, a3
        add     a6, a6, a2

volta834y ago

> RISC-V is faster..

I find it funny that you make the same pitfall than the author did.

Faster on which CPU?

The author doesn't measure on any CPU, so here there are dozens of people hypothesizing whether fusion happens or not, and what the impact is.

jhallenworld4y ago

All other things equal, you would prefer smaller code for better cache use.

1 more reply

brutal_chaos_4y ago

> Faster on which CPU?

Perhaps faster means fewer instructions in this instance? Considering number of instructions is what has been discussed.

1 more reply

brucehoult4y ago

Note that with the newly-ratified B extension, RISC-V has BIC (called ANDN) as well as ORN and XNOR.

ksec4y ago· 4 in thread

The unwritten rule of HN:

You do not criticise The Rusted Holy Grail and the Riscy Silver Bullet.

mhh__4y ago

Rust is pretty well regarded technically here, but RISC-V is mostly a "wouldn't it be nice" rather than something most commenters seem to be super knowledgeable about.

Many people still think that RISC-V implies an open source implementation, for example.

rbanffy4y ago

Or, if you do, you'd better be absolutely right, or people will tear your argument to shreds.

sophacles4y ago

It's a sad state - who wants to have their random inaccurate theory debunked by facts? Imagination land is way, way more fun.

DeathArrow4y ago

You can do worse than that. You can take the Linux's name in vain.

SavantIdiot4y ago· 3 in thread

A bit off topic, but when did a DWORD implicitly become 64bits?

veltas4y ago

SavantIdiot4y ago

Clearly I'm stuck in late-1980's-x86-ASM-land. BYTE, WORD, DWORD[, QWORD].

robertlagrant4y ago

I heard the bird was DWORD.

Symmetry4y ago· 2 in thread

okl4y ago

Branch delay slots are an artifact of a simple pipeline without speculation. There's nothing inherently "bad" about them.

pm2154y ago

nynx4y ago· 2 in thread

kayamon4y ago

It already has this.

nynx4y ago

Yes, the simd extension has the flexible vectors thing. I don't think it has a way to treat simd vectors as bigints.

pcwalton4y ago· 2 in thread

Doesn't RISC-V have an add-with-carry instruction as part of the vector extension? I see it listed here: https://github.com/riscv/riscv-v-spec/releases/tag/v1.0

monocasa4y ago

pcwalton4y ago

Sure, but this email is in the context of GMP, which should be using the vector extension, no?

1 more reply

CalChris4y ago· 2 in thread

TL;DR

  My code snippet results in bloated code for RISC-V RV64I.

I'm not sure how bloated it is. All of those instructions will compress [1].

[1] https://riscv.org/wp-content/uploads/2015/05/riscv-compresse...

It's slower on RISC-V but not a lot on a superscalar. The x86 and ARMv8 snippets have 2 cycles of latency. The RISC-V has 4 cycles of latency.

  1. add  t0, a4, a6  add  t1, a5, a7
  2. sltu t6, t0, a4  sltu t2, t1, a5
  3. add  t4, t1, t6  sltu t3, t4, t1
  4.                  add  t6, t2, t3

I'm not getting terrible from this.

Koffiepoeder4y ago

snvzz4y ago

The article's approach to arguing against RISC-V is fairly childish.

>"here's this snippet, it takes more instructions on RISC-V, thus RISC-V bad"

dlsa4y ago· 2 in thread

I noticed high and low in there so those code snippets look like 32 bit code, at least to me.

Is that even a fair comparison given the arm and x86 versions used as examples of "better" were 64 bit?

From the article:

Let's look at some examples of how Risc V underperforms.

First, addition of a double-word integer with carry-out:

add t0, a4, a6 // add low words

sltu t6, t0, a4 // compute carry-out from low add

add t1, a5, a7 // add hi words

sltu t2, t1, a5 // compute carry-out from high add

add t4, t1, t6 // add carry to low result

sltu t3, t4, t1 // compute carry out from the carry add

add t6, t2, t3 // combine carries

Same for 64-bit arm:

adds x12, x6, x10

adcs x13, x7, x11

Same for 64-bit x86:

add %r8, %rax

adc %r9, %rdx

adrian_b4y ago

dlsa4y ago

So the 32 bit code and the 64 bit code is equally inefficient in your opinion?

throwaway199374y ago· 2 in thread

TL;DR RISC-V doesn't have add with carry.

I'm not a fan of the RISC-V design but the presence or absence of this instruction doesn't make it a terrible architecture.

stephencanon4y ago

_For the purposes of implementing multi-word arithmetic_, which is Torbjörn's whole deal, it kind of does. (Also the actual post subject is "greatly underperforms").

FullyFunctional4y ago

1 more reply

marcodiego4y ago· 1 in thread

So, how meaningful is the "projected score of 11+ SPECInt2006/GHz" as claimed here: https://www.sifive.com/press/sifive-raises-risc-v-performanc... ?

Symmetry4y ago

fhood4y ago· 1 in thread

adrian_b4y ago

The RISC idea was to not include in the ISA instructions so complex that they would require a multi-cycle implementation.

So 64-bit addition/subtraction is certainly expected to be included in any RISC ISA.

I do not see any relationship between these bits and the RISC concepts, omitting them does not simplify the hardware, but it makes the software more complex and inefficient.

xondono4y ago· 1 in thread

Experimenting with RISC-V is one of those things I keep postponing.

For those are more versed, is this really a general problem?

fwsgonzo4y ago

RISC-V is completely fine, heavily based on research and well thought out. It does have pros and cons like any other architecture, and for what it does well, it does it really well!

tomxor4y ago· 1 in thread

> Let's look at some examples (7 instructions vs 2 vs 2)

Isn't this the classic RISC vs CISC problem?

Comparing x86/ARM to RISC-V feels like Apples to Grains of Rice.

mhh__4y ago

It's not exactly RISC VS CISC.

RISC-V was born partly out of a desire for a teaching ISA, also, so simplicity is a boon in that context too.

aappleby4y ago· 1 in thread

The author seems to be assuming that the designers have never thought about this corner case.

sanxiyn4y ago

No, the author is arguing this is not a corner case but a central? case. I tend to agree.

kazinator4y ago

Godbolt:

  typedef __int128_t int128_t;

  int128_t add(int128_t left, int128_t right)
  {
    return left + right;
  }

GCC 10, -O2, RISC-V:

  add(__int128, __int128):
        mv      a5,a0
        add     a0,a0,a2
        sltu    a5,a0,a5
        add     a1,a1,a3
        add     a1,a5,a1
        ret

ARM64:

  add(__int128, __int128):
        adds    x0, x0, x2
        adc     x1, x1, x3
        ret

This issue hurts the wider types that are compiler built-ins.

Even though C has a programming model that is devoid of any carry flag concept, canned types like a 128 bit integer can take advantage of it.

jasonhansel4y ago

DeathArrow4y ago

Technical lead for SoC architecture at Nokia dismissed Risc V: https://www.quora.com/Is-RISC-V-the-future/answer/Heikki-Kul...

bell-cot4y ago

YesThatTom24y ago

I call this "benchmark by visual inspection". It is completely useless. Yet, many top devs that I know seem to think that they can emulate a complex chip in their head better than... the chip itself.

mda4y ago

Honestly in my eyes, author loses all credibility after saying this:

"I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project"

Utter horse manure.

throwaway4good4y ago

Perhaps something similar is needed within ISAs / CPUs ? Say an OS kernel, a ZIP-algorithm, Mandelbrot, Fizz-buzz ... could measure code compactness but also performance and energy usage.

rep_lodsb4y ago

Carry flag and overflow checking? We don't need those things because C doesn't support it! That sadly seems to be the kind of thought process behind RISC-V, and a lot of other "modern" computing:

adapteva4y ago

Over 200 comments and not a single benchmark comparison, if only there was some way to settle this argument...sigh.

usr11064y ago

The code given is arbitrary precision addition. How often do you need that in general computing? Hardly often enough to make a measurable difference.

Whether the similar awkwardness applies to a lot of other code or not is not being told by this isolated case.

Shadonototra4y ago

Who changed the title?

Moderators where are you?

akimball4y ago

The sad thing is that many people will just read the headline or the original email and walk away misinformed and indeed disinformed

kayamon4y ago

"Gee no carry flag how will we cope?"

oneplane4y ago

It doesn't matter how great something else could be in theory if it doesn't exist or doesn't meet the same scale and mindshare (or adoption).

j / k navigate · click thread line to collapse