Tiny GPU: A minimal GPU implementation in Verilog (opens in new tab)

(github.com)

315 pointsfgblanch2y ago73 comments

73 comments

50 comments · 9 top-level

Narishma2y ago· 14 in thread

Yet another "GPU" providing no graphics functionality. IMO theses should be called something else.

The first question is why is there a divide between CPUs and GPUs in the first place. Yes, the gap is closing and both categories are adding features of one another but there still is a significant divide. IMO it has to do with Amdahl's law [0]. In that sense CPUs should be called Latency-Optimizing-Processors (LOPs) and GPUs should be called Throughput-Optimizing-Processors (TOPs).

More specifically [1] we could also call CPUs long / deep data dependency processors (LDDPs) and GPUs wide / flat data dependency processors (WDDPs).

[0]: https://en.wikipedia.org/wiki/Amdahl%27s_law [1]: https://en.wikipedia.org/wiki/Data_dependency

gpderetta2y ago

The observation that graphic hardware and general purpose CPUs converge and diverge is not new: http://cva.stanford.edu/classes/cs99s/papers/myer-sutherland... .

But as you observe, we are stuck in a local optimum where GPUs are optimized for throughput and CPUs for latency sensitive work.

1272y ago

TPU, a Tensor Processing Unit

Tensors are just n-dimensional arrays

Then you can run software (firmware) on top of the TPU to make it behave like a GPU.

deivid2y ago

I've been thinking about starting a project to build a 'display adapter', but I've gotten stuck before starting as I wasn't able to figure out what is the communication protocol between UEFI's GOP driver and the display adapter. I've been trying to piece it together from EDK2's source, but it's unclear how much of this is QEMU-specific

cbm-vic-202y ago

MPU- Matrix Processing Unit.

tossandthrow2y ago

I think the establishing term is AIA, AI Accelerator.

fancyfredbot2y ago

I have seen the term NPU used in reference to neural network accelerators a lot. I think AMD, Intel and Qualcomm all use this term for their AI accelerators. I think Apple call their AI accelerators neural engines, but I've definitely heard others refer to these as NPUs even though that's not their official name.

I'll be honest I've never heard the AIA acronym used in this way. It seems all acronyms for all processors need to end in PU, for better or for worse.

n4r92y ago

That would ignore applications like crypto mining, which I'm guessing is still a biggie.

What is it exactly that sets these units apart from CPUs? Something to do with the parallel nature of the hardware?

3 more replies

checker6592y ago

GPGPU

andersa2y ago

Easy, it's now a General Processing Unit. Or perhaps a Great Processing Unit?

barkingcat2y ago

Is that pronounced gee pee you, or Gip Pee You?

Narishma2y ago

But how is that different from a CPU?

2 more replies

djmips2y ago

Haha I love this project but it's just PU

how2dothis2y ago

...Won't sound offending... But, but ...a Graphics-card; has "Ports (to attach a Keyboard on to)", RAM (verry fast), CPUs (many) and may be used like a full Computer, even without been driven by someone else like ...You -I suspect, not ?

...I for my part want to say thanks for the findings! :-)

[Setting:Weekendmodus]

piotrrojek2y ago· 10 in thread

Really awesome project. I want to get into FPGAs, but honestly it's even hard to grasp where to start and the whole field feels very intimidating. My eventual goal would be to create acceleration card for LLMs (completely arbitrary), so a lot of same bits and pieces as in this project, probably except for memory offloading part to load bigger models.

Aromasin2y ago

Reframe it in your mind. "Getting into FPGAs" needs to be broken down. There are so many subsets of skills within the field that you need to level expectations. No one expects a software engineer to jump into things by building a full computer from first principles, writing an instruction set architecture, understanding machine code, converting that to assembly, and then developing a programming language so that they can write a bit of Python code to build an application. You start from the top and work your way down the stack.

If you abstract away the complexities and focus on building a system using some pre-built IP, FPGA design is pretty easy. I always point people to something like MATLAB, so they can create some initial applications using HDL Coder on a DevKit with a Reference design. Otherwise, there's the massive overhead of learning digital computing architecture, Verilog, timing, transceivers/IO, pin planning, Quartus/Vivado, simulation/verification, embedded systems, etc.

In short, start with some system-level design. Take some plug-and-play IP, learn how to hook together at the top level, and insert that module into a prebuilt reference design. Eventually, peel back the layers to reveal the complexity underneath.

1 more reply

checker6592y ago

I'm in the same boat. Here's my plan.

1. Read Harris, Harris → Digital Design and Computer Architecture. (2022). Elsevier. https://doi.org/10.1016/c2019-0-00213-0

2. Follow the author's RVFpga course to build an actual RISC-V CPU on an FPGA → https://www.youtube.com/watch?v=ePv3xD3ZmnY

dailykoder2y ago

Love the Harris and Harris book!

I might add these:

- Computer Architecture, Fifth Edition: A Quantitative Approach - https://dl.acm.org/doi/book/10.5555/1999263

- Computer Organization and Design RISC-V Edition: The Hardware Software Interface - https://dl.acm.org/doi/10.5555/3153875

both by Patterson and Hennessy

Edit: And if you want to get into CPU design and can get a grip on "Advanced Computer Architecture: Parallelism, Scalability, Programmability" by Kai Hwang, then i'd recommend that too. It's super old and probably some things are made differently in newer CPUs, but it's exceptionally good to learn the fundamentals. Very well written. But I think it's hard to find a good (physical) copy.

ruslan2y ago

I would suggest the following route:

1. Clone this educational repo https://github.com/yuri-panchul/basics-graphics-music - a set of simple labs for those learning Verilog from the scratch. It's written by Yuri Panchul who worked at Imagination developing GPUs, by the way. :) 2. Obtain one of the dozens supported FPGA boards and some accessories (keys, LEDs, etc). 3. Install Yosys and friends. 4. Perform as many labs from the repo as you can, starting from lab01 - DeMorgan.

You can exercise labs while reading Harris&Harris. Once done with the labs and with the book, it's time to start your own project. :)

PS: They have a weekly meetup at HackerMojo, you can participate by Zoom if you are not in the Valley.

samvher2y ago

I don't know where you are in your journey, but I found these resources very helpful to better understand digital logic and CPU/GPU architecture:

1. https://learn.saylor.org/course/CS301

2. https://www.coursera.org/learn/comparch

3. https://hdlbits.01xz.net/wiki/Main_Page

imtringued2y ago

If you want to accelerate LLMs, you will need to know the architecture first. Start from that. The hardware is actually both the easy (design) and the hard part (manufacturing).

IshKebab2y ago

You might want to pick a more realistic goal! An FPGA capable of accelerating LLMs is going to cost at least tens of thousands, probably hundreds.

JoachimS2y ago

Depends heavily on what system it is supposed to provide acceleration for.

If it is a MCU based on a simple ARM Cortex M0, M0+, M3 or RISC-V RV3I, then you could use an iCE40 or similar FPGA to provide a big acceleration by just using the DSPs and the big SPRAM.

Basically add the custom compute operations and space that doesn't exist in the MCU, operations that would take several, many instructions to do in SW. Also, just by offloading to the FPGA AI 'co-processor' frees up the MCU to do other things.

The kernel operations in the Tiny GPU project is actually really good examples of things you could efficiently implement in an iCE40UP FPGA device, resulting in substantial acceleration. And using EBRs (block RAM) and/or the SPRAM for block queues would make a nice interface to the MCU.

One could also implement a RISC-V core in the FPGA, thus having a single chip with a low latency interface to the AI accelerator. You could even implement the AI acceleator as a set of custom instructions. There are so many possible solutions!

An ice40UP-5K FPGA will set you back 9 EUR in single quantity.

This concept of course scales up to performance and cost levels you talk about. With many possible steps in between.

2 more replies

imtringued2y ago

Something that appears to be hardly known is that the transformer architecture needs to become more compute bound. Inventing a machine learning architecture which is FLOPs heavy instead of bandwidth heavy would be a good start.

It could be as simple as using a CNN instead of a V matrix. Yes, this makes the architecture less efficient, but it also makes it easier for an accelerator to speed it up, since CNNs tend to be compute bound.

lusus_naturae2y ago

A simple project is implementing a FIR filter using a HDL like Verilog. The Altera university FPGAs are cheap enough.

userbinator2y ago· 5 in thread

Because the GPU market is so competitive, low-level technical details for all modern architectures remain proprietary.

Except for Intel, which publishes lots of technical documentation on their GPUs: https://kiwitree.net/~lina/intel-gfx-docs/prm/

You can also find the i810/815 manuals elsewhere online, but except for an odd gap between that and the 965 (i.e. missing the 855/910/915/945) for some reason, they've been pretty consistent with the documentation.

kimixa2y ago

AMD also publish a fair bit of documentation - https://www.amd.com/en/developer/browse-by-resource-type/doc...

Includes full ISA documentation of their current and past offerings, though look like they tend to be aimed at implementors rather than "high level" description for interested enthusiasts.

adrian_b2y ago

The AMD documentation consists mostly of very laconic descriptions of the (hundreds or thousands of) registers and of their bit fields.

There is almost no explanation about how they are intended to be used and about the detailed microarchitecture of their GPUs. For that, the best remains to read the source code of their Linux drivers, though even that is not always as informative as it could be, as some of the C code may have been automatically generated from some other form used internally by AMD.

The Intel documentation is much more complete.

Nevertheless, AMD has promised recently in several occasions that they will publish in the near future additional GPU documentation and additional open-source parts of their GPU drivers, so hopefully there will be a time when the quality of their documentation will match again that of Intel, like it did until around 2000.

matheusmoreira2y ago

The Linux drivers are also high quality and mainlined. Wish every company followed their lead.

EasyMark2y ago

My cheap little dell laptop is the most solid machine I have in my house and I haven't seen it crash yet and I half suspect it's because it's the only one with Intel gpu only in it :) . My tower machine can go a few days without crashing but inevitably it will whether it's 3 days or 1 week. Nvidia card. It's not often enough to really worry about but I've been thinking about switching to an AMD card, if I can find a second hand one reasonably priced.

xeonmc2y ago

Somewhat relevant, from 2018:

[The Thirty Million Line Problem - Casey Muratori](https://www.youtube.com/watch?v=kZRE7HIO3vk)

Jasper_2y ago· 4 in thread

> Since threads are processed in parallel, tiny-gpu assumes that all threads "converge" to the same program counter after each instruction - which is a naive assumption for the sake of simplicity.

> In real GPUs, individual threads can branch to different PCs, causing branch divergence where a group of threads threads initially being processed together has to split out into separate execution.

Whoops. Maybe this person should try programming for a GPU before attempting to build one out of silicon.

Not to mention the whole SIMD that... isn't.

(This is the same person who stapled together other people's circuits to blink an LED and claimed to have built a CPU)

bootsmann2y ago

Isn't the first just equivalent to calling __syncthreads() on every launch?

Jasper_2y ago

No, that effectively syncs all warps in a thread group. This implementation isn't doing any synchronization, it's independently doing PC/decode for different instructions, and just assuming they won't diverge. That's... a baffling combination of decisions; why do independent PC/decode if they're not to diverge? It reads as a very basic lack of ability to understand the core fundamental value of a GPU. And this isn't a secret GPU architecture thing. Here's a slide deck from 2009 going over the actual high-level architecture of a GPU. Notice how fetch/decode are shared between threads.

https://engineering.purdue.edu/~smidkiff/ece563/slides/GPU.p...

stanleykm2y ago

syncthreads synchronizes threads within a threadgroup and not across all threads.

hyperbovine2y ago

Which experienced CUDA programmers do anyways!

ginko2y ago· 3 in thread

Really cool project I love seeing HW projects like this in the open. But I'd argue that this is a SIMD coprocessor. For something to be a GPU it should at least have some sort of display output.

I know the terminology has gotten quite loose in recent years with Nvidia & Co. selling server-only variants of their graphics architectures as GPUs, but the "graphics" part of GPU designs make up a significant part of the complexity, to this day.

jdiff2y ago

If it processes graphics, I think it counts, even if it has no output. There's still use for GPUs even if they're not outputting anything. My place of work has around 75 workstations with mid-tier Quadros, but they only have mini-DisplayPort and my employer only springs for HDMI cables, so they're all hooked into the onboard graphics. The cards still accelerate our software, they still process graphics, they just don't output them.

omikun2y ago

It's the shader core of a GPU. There are no graphics specific pipelines, eg: vertex processing, culling, rasterizer, color buffer, depth buffer, etc. That's like saying a CPU is also a GPU if it runs graphics in software.

Narishma2y ago

> If it processes graphics, I think it counts, even if it has no output.

That's not a good definition, since a CPU or a DSP would count as a GPU. Both have been used for such purpose in the past.

> There's still use for GPUs even if they're not outputting anything.

The issue is not their existence, it about calling them GPUs when they have no graphics functionality.

3 more replies

vineyardlabs2y ago· 2 in thread

Is there a reason they're mixing non-blocking and blocking assignment operators in sequential always blocks here?

_blz22y ago

looks like those are local variables

CamperBob22y ago

You can feel free to do that, if you're not too hung up on simulation-synthesis matching.

1 more reply

jgarzik2y ago· 1 in thread

Nice! I warmly encourage open-core GPU work.

Here's another: https://github.com/jbush001/NyuziProcessor

joe_the_user2y ago

What would be nice would be a bare-bones CUDA implementation for one these open-core processors.

What size run would be needed for TSMC or some other fab to produce such a processor economically?

novaRom2y ago· 1 in thread

I did something similar many years ago in VHDL. There was a site called opencores for different open source HDL projects. I wonder if is there any good HPC level large scale distributed HDL simulator exists today? It makes sense to utilize modern GPUs for making RTL level simulations.

Someone2y ago

> There was a site called opencores for different open source HDL projects

Was? https://opencores.org/projects?language=VHDL. Or is that not the same but similar?

mk_stjames2y ago· 1 in thread

Uh, the ALU implements a DIV instruction straight up at the hardware level? Is this normal to have as a real instruction in something like a modern CUDA core or is DIV usually a software emulation instead? Because actual hardware divide circuits take up a ton a space and I wouldn't have expected them in a GPU ALU.

It's so easy to write "DIV: begin alu_out_reg <= rs / rt; end" in your verilog but that one line takes a lotta silicon. But the person simulating this might not never see that if all they do is simulate the verilog.

daghamm2y ago

This is just someone learning Verilog.

The project stops at simulation, making real hardware out of this requires much more work.

j / k navigate · click thread line to collapse

73 comments

50 comments · 9 top-level

Narishma2y ago· 14 in thread

Yet another "GPU" providing no graphics functionality. IMO theses should be called something else.

Lichtso2y ago

More specifically [1] we could also call CPUs long / deep data dependency processors (LDDPs) and GPUs wide / flat data dependency processors (WDDPs).

[0]: https://en.wikipedia.org/wiki/Amdahl%27s_law [1]: https://en.wikipedia.org/wiki/Data_dependency

gpderetta2y ago

The observation that graphic hardware and general purpose CPUs converge and diverge is not new: http://cva.stanford.edu/classes/cs99s/papers/myer-sutherland... .

But as you observe, we are stuck in a local optimum where GPUs are optimized for throughput and CPUs for latency sensitive work.

1272y ago

TPU, a Tensor Processing Unit

Tensors are just n-dimensional arrays

Then you can run software (firmware) on top of the TPU to make it behave like a GPU.

deivid2y ago

cbm-vic-202y ago

MPU- Matrix Processing Unit.

tossandthrow2y ago

I think the establishing term is AIA, AI Accelerator.

fancyfredbot2y ago

I'll be honest I've never heard the AIA acronym used in this way. It seems all acronyms for all processors need to end in PU, for better or for worse.

n4r92y ago

That would ignore applications like crypto mining, which I'm guessing is still a biggie.

What is it exactly that sets these units apart from CPUs? Something to do with the parallel nature of the hardware?

3 more replies

checker6592y ago

GPGPU

andersa2y ago

Easy, it's now a General Processing Unit. Or perhaps a Great Processing Unit?

barkingcat2y ago

Is that pronounced gee pee you, or Gip Pee You?

Narishma2y ago

But how is that different from a CPU?

2 more replies

djmips2y ago

Haha I love this project but it's just PU

how2dothis2y ago

...I for my part want to say thanks for the findings! :-)

[Setting:Weekendmodus]

piotrrojek2y ago· 10 in thread

Aromasin2y ago

1 more reply

checker6592y ago

I'm in the same boat. Here's my plan.

1. Read Harris, Harris → Digital Design and Computer Architecture. (2022). Elsevier. https://doi.org/10.1016/c2019-0-00213-0

2. Follow the author's RVFpga course to build an actual RISC-V CPU on an FPGA → https://www.youtube.com/watch?v=ePv3xD3ZmnY

dailykoder2y ago

Love the Harris and Harris book!

I might add these:

- Computer Architecture, Fifth Edition: A Quantitative Approach - https://dl.acm.org/doi/book/10.5555/1999263

- Computer Organization and Design RISC-V Edition: The Hardware Software Interface - https://dl.acm.org/doi/10.5555/3153875

both by Patterson and Hennessy

ruslan2y ago

I would suggest the following route:

You can exercise labs while reading Harris&Harris. Once done with the labs and with the book, it's time to start your own project. :)

PS: They have a weekly meetup at HackerMojo, you can participate by Zoom if you are not in the Valley.

samvher2y ago

I don't know where you are in your journey, but I found these resources very helpful to better understand digital logic and CPU/GPU architecture:

1. https://learn.saylor.org/course/CS301

2. https://www.coursera.org/learn/comparch

3. https://hdlbits.01xz.net/wiki/Main_Page

imtringued2y ago

If you want to accelerate LLMs, you will need to know the architecture first. Start from that. The hardware is actually both the easy (design) and the hard part (manufacturing).

IshKebab2y ago

You might want to pick a more realistic goal! An FPGA capable of accelerating LLMs is going to cost at least tens of thousands, probably hundreds.

JoachimS2y ago

Depends heavily on what system it is supposed to provide acceleration for.

If it is a MCU based on a simple ARM Cortex M0, M0+, M3 or RISC-V RV3I, then you could use an iCE40 or similar FPGA to provide a big acceleration by just using the DSPs and the big SPRAM.

An ice40UP-5K FPGA will set you back 9 EUR in single quantity.

This concept of course scales up to performance and cost levels you talk about. With many possible steps in between.

2 more replies

imtringued2y ago

lusus_naturae2y ago

A simple project is implementing a FIR filter using a HDL like Verilog. The Altera university FPGAs are cheap enough.

userbinator2y ago· 5 in thread

Because the GPU market is so competitive, low-level technical details for all modern architectures remain proprietary.

Except for Intel, which publishes lots of technical documentation on their GPUs: https://kiwitree.net/~lina/intel-gfx-docs/prm/

kimixa2y ago

AMD also publish a fair bit of documentation - https://www.amd.com/en/developer/browse-by-resource-type/doc...

Includes full ISA documentation of their current and past offerings, though look like they tend to be aimed at implementors rather than "high level" description for interested enthusiasts.

adrian_b2y ago

The AMD documentation consists mostly of very laconic descriptions of the (hundreds or thousands of) registers and of their bit fields.

The Intel documentation is much more complete.

matheusmoreira2y ago

The Linux drivers are also high quality and mainlined. Wish every company followed their lead.

EasyMark2y ago

xeonmc2y ago

Somewhat relevant, from 2018:

[The Thirty Million Line Problem - Casey Muratori](https://www.youtube.com/watch?v=kZRE7HIO3vk)

Jasper_2y ago· 4 in thread

> Since threads are processed in parallel, tiny-gpu assumes that all threads "converge" to the same program counter after each instruction - which is a naive assumption for the sake of simplicity.

> In real GPUs, individual threads can branch to different PCs, causing branch divergence where a group of threads threads initially being processed together has to split out into separate execution.

Whoops. Maybe this person should try programming for a GPU before attempting to build one out of silicon.

Not to mention the whole SIMD that... isn't.

(This is the same person who stapled together other people's circuits to blink an LED and claimed to have built a CPU)

bootsmann2y ago

Isn't the first just equivalent to calling __syncthreads() on every launch?

Jasper_2y ago

https://engineering.purdue.edu/~smidkiff/ece563/slides/GPU.p...

stanleykm2y ago

syncthreads synchronizes threads within a threadgroup and not across all threads.

hyperbovine2y ago

Which experienced CUDA programmers do anyways!

ginko2y ago· 3 in thread

Really cool project I love seeing HW projects like this in the open. But I'd argue that this is a SIMD coprocessor. For something to be a GPU it should at least have some sort of display output.

jdiff2y ago

omikun2y ago

Narishma2y ago

> If it processes graphics, I think it counts, even if it has no output.

That's not a good definition, since a CPU or a DSP would count as a GPU. Both have been used for such purpose in the past.

> There's still use for GPUs even if they're not outputting anything.

The issue is not their existence, it about calling them GPUs when they have no graphics functionality.

3 more replies

vineyardlabs2y ago· 2 in thread

Is there a reason they're mixing non-blocking and blocking assignment operators in sequential always blocks here?

_blz22y ago

looks like those are local variables

CamperBob22y ago

You can feel free to do that, if you're not too hung up on simulation-synthesis matching.

1 more reply

jgarzik2y ago· 1 in thread

Nice! I warmly encourage open-core GPU work.

Here's another: https://github.com/jbush001/NyuziProcessor

joe_the_user2y ago

What would be nice would be a bare-bones CUDA implementation for one these open-core processors.

What size run would be needed for TSMC or some other fab to produce such a processor economically?

novaRom2y ago· 1 in thread

Someone2y ago

> There was a site called opencores for different open source HDL projects

Was? https://opencores.org/projects?language=VHDL. Or is that not the same but similar?

mk_stjames2y ago· 1 in thread

daghamm2y ago

This is just someone learning Verilog.

The project stops at simulation, making real hardware out of this requires much more work.

j / k navigate · click thread line to collapse