What happens if we explicitly architect a RISC-V CPU to execute RVC instructions, and "mop up" any RV32I instructions that aren't convenient via a microcode layer? What architectural optimizations are unlocked as a result?
"Minimax" is an experimental RISC-V implementation intended to establish if an RVC-optimized CPU is, in practice, any simpler than an ordinary RV32I core with pre-decoder. While it passes a modest test suite, you should not use it without caution. (There are a large number of excellent, open source, "little" RISC-V implementations you should probably use reach for first.)
Will the execute stage pipeline effectively to reach higher f_max? (Of course there will be a small logic penalty, and a larger FF penalty, but the core is small enough that it would probably be tolerable.) Or is the core's whole architecture predicated on a two stage design?
With that in mind: the single instruction-per-clock design is for simplicity's sake, not performance's sake. If the execution stage were pipelined, it'd be a different core. If performance is the goal, I'd start by ripping out some of the details that distinguish this core from other (excellent) RISC-V cores.
> (That specific clock frequency on that specific part carries heavy hints about what this core is intended for.)
Fun clue! Looks like the Xilinx KU060 is a rad-hard FPGA for space applications. Does anyone know what 200 MHz might imply? Comms maybe?
I picked probably the worst time imaginable to get into FPGAs. All of my "higher" end stuff is repurposed mining hardware...
And the thought of handwriting one in RISC-V assembly convinced me that maybe RISC-V wasn't as "retro friendly" as I would have liked.
The retrocomputing scene looks like a ton of fun and I'd be delighted if any of my work is used there.
These days I'd probably consider forking my friend Randy's C64 VICII implementation (VIC-II Kawari) and just expand framebuffer size, sprites, colours, etc, since he put so much work into it.
It was a lot of fun, but I got stalled on the SD card interface. That was more complexity than I felt with dealing at that point. And I was working at Google at the time and so they owned all my thoughts and deeds and going through the open sourcing process for it would have been a hassle. If I wasn't hunting for work and needing to make $$ right now, I'd pick it up again maybe? Was more of a verilog learning process.
At a certain scale, it's conventional for hardware designs to become complex enough that it's necessary to structure them in hierarchies, just to maintain control. This design is small enough that none of the extra structure is essential.
It's possible to be incredibly expressive in Verilog and VHDL. This implementation is written in VHDL, which has an outdated reputation for being long-winded.
Also worth a look: FemtoRV32 Quark [0], which is written in Verilog.
[0]: https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/...
Building a custom CPU commits you to writing an assembler and listing generator - which is a good hobby-project job for one person who's handy with Python. After stumbling through those foothills, though, I found myself at the base of some very steep, scary GCC/binutils cliffs wondering how I could have gotten so lost, so far from home.
Even if all RISC-V does is offer a bunch of arbitrary answers to arbitrary design questions, I consider it a massive win.
This is just that. But nothing more. For example it does not handle any RISC-V CSRs, even the most basic ones. But that's OK: for "computational" machine code kernels that aren't fancy (i.e. basic ALUs get lit up but nothing fancier), you can use software toolchains to emit compatible code like GCC.
A "real" toy CPU i.e. one that won't win awards but can boot something like Zephyr OS or a maybe a miniature OS with some form of memory protection will require many more lines; for proper exception handling, for that memory protection, timers and peripherials, for extra CPU features (atomics, debug interface, whatever.) A comparable CPU for this might be something like PicoRV32, which fits in at about ~2,000 lines of Verilog.
But that's a lot of stuff. Sometimes all you need is a programmable state machine. And with this you can run (limited) normal C programs on it with a supported compiler on a 32-bit machine.
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
It looks that the VHDL source is about instruction decoding, registers, etc, but does not include things like ALU logic. (I don't know VHDL actually.)But it's true that you don't have to describe the ALU down to the bit level - thanks to those two lines you can say "q <= d1 + d2" instead of having to build an adder at the gate level. (Though you can, of course, do that if you really want to!)
OP's CPU takes up around 400 LUTs. Since a 2:1 mux takes up 1 LUT (although it seems the numbers are for a LUT6-based device, which can take a 4:1 mux, so maybe that can make the amount a bit lower?), you would add 160 LUTs. That's quite a lot.
Unlike adders, a barrel shifter does not generalize well enough to be implemented as a hard block in its own right.
The bus is up to you... should you want a 8bit data bus and 16 bit address bus, I don't think the spec cares.
This is akin to 68020 (32bit ISA) vs 68000 (still 32bit ISA) or 68008 (still 32bit ISA).
A narrower data bus would allow a 2-cycle execution path, and would likely split the longest combinatorial path in the current design (which certainly goes through the adder tree.) This could be either an 0.5 instruction-per-clock (IPC) design, or a pipelined design that maintains 1 IPC at the expense of extra pipeline hazards and corresponding bubbles.
A narrower address seems like it's only helpful as a knock-on to a split data bus.
Gut feeling: I doubt that splitting the data or address buses into additional phases would actually save resources. You would certainly need more flip-flops to maintain state, and more LUTs to manage combinational paths across the two execution stages. While you can sometimes add complexity and "win back" gates, it's an approach with limits. If you compare SERV's resource usage to FemtoRV32-Quark's, it's notable how much additional state (flip-flops) SERV "spends" to reduce its combinatorial logic (LUT) footprint.
Doesn’t it make this… an IISC? Increased instruction set? Asking for a friend
These four points all have direct benefits on hardware design. And compressed ISA like RVC and Thumb checks them all.
On the contrary, "fewer instruction types", "orthognoal instructions" never had any real benefit beyond perceptual aesthetics, so as a result they are long abandoned.