Show HN: Minimax – A Compressed-First, Microcoded RISC-V CPU | Better HN

38 comments

32 comments · 7 top-level

thrtythreeforty3y ago· 6 in thread

This is very impressive, especially the performance per LUT! Did I overlook frequency spec on a given target or did you not specify?

Will the execute stage pipeline effectively to reach higher f_max? (Of course there will be a small logic penalty, and a larger FF penalty, but the core is small enough that it would probably be tolerable.) Or is the core's whole architecture predicated on a two stage design?

gsmecherOP3y ago

This core is targeted at "smaller-is-better" applications with few actual instruction-throughput requirements. If it reaches 200 MHz on a Xilinx KU060, I will be delighted. (That specific clock frequency on that specific part carries heavy hints about what this core is intended for.)

With that in mind: the single instruction-per-clock design is for simplicity's sake, not performance's sake. If the execution stage were pipelined, it'd be a different core. If performance is the goal, I'd start by ripping out some of the details that distinguish this core from other (excellent) RISC-V cores.

thatcherc3y ago

> 200 MHz on a Xilinx KU060

> (That specific clock frequency on that specific part carries heavy hints about what this core is intended for.)

Fun clue! Looks like the Xilinx KU060 is a rad-hard FPGA for space applications. Does anyone know what 200 MHz might imply? Comms maybe?

varispeed3y ago

KU060 costs a nice sum of £4,529.10 on Mouser (out of stock of course)

A fully space-qualified version is something like $150k.

Teknoman1173y ago

> out of stock of course

I picked probably the worst time imaginable to get into FPGAs. All of my "higher" end stuff is repurposed mining hardware...

gaudat3y ago

Poor man's Tile64?

cmrdporcupine3y ago· 6 in thread

This is very nice. A couple years ago I was playing around with a hobby project I was dubbing "Retro-V" which was to be a RISC-V core tied to a 1980s-style display processor and keyboard/mouse input on a small FGPA and 512k or 1MB or so of SRAM. I was using PicoRV32 for that, but this would have been be far better.

gsmecherOP3y ago

PicoRV32 and FemtoRV32 are both excellent, conventional RISC-V implementations, and are more complete and proven than Minimax. Relative to the size of any 7-series or newer Xilinx FPGA, the difference in LUT cost between any of the three is pretty minor. I think you made a perfectly defensible decision. (I love me some SERV, too, and if you are willing to spend orthodoxy to save gates, it's an excellent choice too.)

cmrdporcupine3y ago

Yes, PicoRV32 is very nice. However for what I was building, with limited RAM, compressed instructions would have made a lot of sense. I started porting a BASIC to my system (in C), and it quite easily would have filled almost the whole 512kB SRAM.

And the thought of handwriting one in RISC-V assembly convinced me that maybe RISC-V wasn't as "retro friendly" as I would have liked.

gsmecherOP3y ago

Understood. Maybe this landed after your project - but both PicoRV32 and SERV now support compressed extensions, at some additional resource cost. FemtoRV32 Quark doesn't - which is not a knock, since it's a beautifully simple implementation and that's the point.

The retrocomputing scene looks like a ton of fun and I'd be delighted if any of my work is used there.

kragen3y ago

What is it about RISC-V assembly you didn't like? The little I've done seems like slightly more hassle than amd64 assembly but nothing like the level of bending over backwards of 6502 assembly.

drh3y ago

Sounds interesting! What were you using for the display processor?

cmrdporcupine3y ago

I was hand-rolling my own. I had it doing a basic 640x480 buffer with some basic character generation and sprite support & HDMI/DVI output

These days I'd probably consider forking my friend Randy's C64 VICII implementation (VIC-II Kawari) and just expand framebuffer size, sprites, colours, etc, since he put so much work into it.

It was a lot of fun, but I got stalled on the SD card interface. That was more complexity than I felt with dealing at that point. And I was working at Google at the time and so they owned all my thoughts and deeds and going through the open sourcing process for it would have been a hassle. If I wasn't hunting for work and needing to make $$ right now, I'd pick it up again maybe? Was more of a verilog learning process.

sterlind3y ago· 6 in thread

the actual Verilog source is incredibly small. I would have thought that implementing a CPU, even a toy one, would take more than 500 lines. is this normal for hardware?

gsmecherOP3y ago

What you see is all there is.

At a certain scale, it's conventional for hardware designs to become complex enough that it's necessary to structure them in hierarchies, just to maintain control. This design is small enough that none of the extra structure is essential.

It's possible to be incredibly expressive in Verilog and VHDL. This implementation is written in VHDL, which has an outdated reputation for being long-winded.

Also worth a look: FemtoRV32 Quark [0], which is written in Verilog.

[0]: https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/...

robinsonb53y ago

Have you seen the OPC series of CPUs? (One Page Computing - the challenge being to keep the code small enough to be printed onto a single sheet of line printer paper!)

gsmecherOP3y ago

Yup! Thanks for pointing OPC [0] out. These CPUs were a huge eye-opener - and a huge lesson about the value of using a standardized instruction set.

Building a custom CPU commits you to writing an assembler and listing generator - which is a good hobby-project job for one person who's handy with Python. After stumbling through those foothills, though, I found myself at the base of some very steep, scary GCC/binutils cliffs wondering how I could have gotten so lost, so far from home.

Even if all RISC-V does is offer a bunch of arbitrary answers to arbitrary design questions, I consider it a massive win.

[0]: https://revaldinho.github.io/opc/

aseipp3y ago

A traditional CPU in its most basic form is nothing more than a programmable state machine where the transition function is the series of instructions you (the programmer) write down, with local state in the form of the register file, and some ports attached to a memory controller (so it can fetch and write instructions and data). A 3-stage fetch/decode/execute pipeline can be done in a very small space if you don't get clever.

This is just that. But nothing more. For example it does not handle any RISC-V CSRs, even the most basic ones. But that's OK: for "computational" machine code kernels that aren't fancy (i.e. basic ALUs get lit up but nothing fancier), you can use software toolchains to emit compatible code like GCC.

A "real" toy CPU i.e. one that won't win awards but can boot something like Zephyr OS or a maybe a miniature OS with some form of memory protection will require many more lines; for proper exception handling, for that memory protection, timers and peripherials, for extra CPU features (atomics, debug interface, whatever.) A comparable CPU for this might be something like PicoRV32, which fits in at about ~2,000 lines of Verilog.

But that's a lot of stuff. Sometimes all you need is a programmable state machine. And with this you can run (limited) normal C programs on it with a supported compiler on a 32-bit machine.

nine_k3y ago

I suspect some heavier lifting is done here:

    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;

It looks that the VHDL source is about instruction decoding, registers, etc, but does not include things like ALU logic. (I don't know VHDL actually.)

robinsonb53y ago

Those two lines are just the VHDL equivalent of #include <stdio.h> - i.e. boilerplate that you'll see in almost every source file.

But it's true that you don't have to describe the ALU down to the bit level - thanks to those two lines you can say "q <= d1 + d2" instead of having to build an adder at the gate level. (Though you can, of course, do that if you really want to!)

remexre3y ago· 3 in thread

Interesting that shifts are in the <1IPC set; I thought those were fairly cheap with a barrel shifter; does this simply omit one for space purposes, or are they more expensive than I expect?

ColonelPhantom3y ago

Barrel shifters are huge in the context of small CPUs (especially on FPGA's). To do a barrel shift, you need (input size) * (shift size) LUTs, as you need that amount of "stages". That means 32*5=160 on RV32, as you can shift by 2^5 bits.

OP's CPU takes up around 400 LUTs. Since a 2:1 mux takes up 1 LUT (although it seems the numbers are for a LUT6-based device, which can take a 4:1 mux, so maybe that can make the amount a bit lower?), you would add 160 LUTs. That's quite a lot.

irdc3y ago

This depends on the fpga's resources: some have barrel shifters as hard IP.

gsmecherOP3y ago

I don't think this is true - on Xilinx, you can coax a DSP48 macro into implementing a barrel shifter, but the underlying primitive is a multiplier and not a barrel shifter.

Unlike adders, a barrel shifter does not generalize well enough to be implemented as a hard block in its own right.

downvotetruth3y ago· 2 in thread

Can the address and/or data also be 16 bit or would that violate RISC-V spec?

snvzz3y ago

AIUI the registers and operations with them should be 32bit for RV32I.

The bus is up to you... should you want a 8bit data bus and 16 bit address bus, I don't think the spec cares.

This is akin to 68020 (32bit ISA) vs 68000 (still 32bit ISA) or 68008 (still 32bit ISA).

gsmecherOP3y ago

I don't think the RISC-V spec cares, either, since it specifies an execution environment but not interfaces.

A narrower data bus would allow a 2-cycle execution path, and would likely split the longest combinatorial path in the current design (which certainly goes through the adder tree.) This could be either an 0.5 instruction-per-clock (IPC) design, or a pipelined design that maintains 1 IPC at the expense of extra pipeline hazards and corresponding bubbles.

A narrower address seems like it's only helpful as a knock-on to a split data bus.

Gut feeling: I doubt that splitting the data or address buses into additional phases would actually save resources. You would certainly need more flip-flops to maintain state, and more LUTs to manage combinational paths across the two execution stages. While you can sometimes add complexity and "win back" gates, it's an approach with limits. If you compare SERV's resource usage to FemtoRV32-Quark's, it's notable how much additional state (flip-flops) SERV "spends" to reduce its combinatorial logic (LUT) footprint.

robinsonb53y ago· 1 in thread

That is very, cool. I'm particularly interested in the compressed-first approach because I have some projects where minimising BRAM usage is paramount so code density really matters. The use of microcode to emulate 32-bit instructions reminds me a lot of ZPU (I still have a soft spot for that architecture) - was that an influence?

gsmecherOP3y ago

I've heard of the ZPU in passing but never looked in much detail - I didn't realize there was a GCC back-end for these machines. James Bowman's J1 CPU [0] is also stack-based and has definitely helped me shape my preferences.

[0]: https://excamera.com/files/j1.pdf

tomcam3y ago· 1 in thread

> RISC-V's compressed instruction (RVC) extension is intended as an add-on

Doesn’t it make this… an IISC? Increased instruction set? Asking for a friend

znwu3y ago

RISC no longer has the clear border as it had 30 years ago. Nowadays RISC just means an ISA has most of the following points: 1. Load/Store architecture 2. Fixed-length instructions or few length variations. 3. Highly uniform instruction encoding. 4. Mostly single-operation instructions.

These four points all have direct benefits on hardware design. And compressed ISA like RVC and Thumb checks them all.

On the contrary, "fewer instruction types", "orthognoal instructions" never had any real benefit beyond perceptual aesthetics, so as a result they are long abandoned.

j / k navigate · click thread line to collapse