Open source RISC-V GPGPU (opens in new tab)

(github.com)

207 points1ntEgr84y ago58 comments

58 comments

39 comments · 12 top-level

unsigner4y ago· 8 in thread

We should really have another word for “chip that runs OpenCL but has no rasterizer”.

I see the title was edited to call it a “GPGPU”, or a “general-purpose GPU” but it’s not a thing; GPGPU was an early moniker for when people tried to do non-graphics work on GPUs many years ago, but it was a word for techniques, never for a specific type of hardware. Plus it feels to me that “general purpose” should be something more than a GPU, while this is strictly less.

raphlinus4y ago

I don't really agree. I think it's completely valid to explore a GPU architecture in which rasterization is done in software, with a perhaps a bit of support in the ISA. That's what they've done here, and they do demonstrate running OpenGL ES.

The value of this approach depends on the workload. If it's mostly rasterizing large triangles with simple shaders, then a hardware rasterizer buys you a lot. However, as triangles get smaller, a pure software rasterizer can win (as demonstrated by Nanite). And as you spend more time in shaders, the relative amount of overhead from software rasterization decreases; this was shown in the cudaraster paper[1].

Overall, if we can get simpler hardware with more of a focus on compute power, I think that's a good thing, and I think it's completely fine to call that a GPU.

[1]: https://research.nvidia.com/publication/high-performance-sof...

justsid4y ago

This is essentially where Sony was trying to go with their Cell architecture in the PS3. Only at the very end did they realize that they actually needed a GPU that can do rasterization in hardware. In fact, a lot of games actually did graphics workloads on the SPEs to help out the pretty weak GPU. The concept can definitely work, especially if the driver takes care of all the programmable bits and exposes a more classical graphics pipeline to the host.

zbendefy4y ago

OpenCL has a category called CL_ DEVICE_ TYPE_ ACCELERATOR for that, so something like 'Accelerator' seems to fit

avianes4y ago

That term is "SIMT architecture."

Modern GPUs (or GPGPUs) are based on the SIMT programming model that requires an SIMT architecture

zozbot2344y ago

"SIMT" is not an architecture, it's just a programming model that ultimately boils down to wide SIMD instructions with conditional execution. Add that to a barrel processor that can hide memory latency across a sizeable amount of hardware threads, and you've got the basics of a GPU "core".

1 more reply

nine_k4y ago

Vector processors? Follow the early Cray nomenclature.

avianes4y ago

The terminology "vector processor" refers to a completely different type of architecture. Using it for SIMT architecture would be confusing

1 more reply

eqvinox4y ago

VPU? With network processors being called NPU these days...

1 more reply

akmittal4y ago· 3 in thread

It great to see RISC-V making a lot of progress. A lot of research is coming from China because of US bans, but hopefully this will be good for whole world.

zucker424y ago

Which U.S. bans are you talking about? Is there anywhere I can read more about this?

bee_rider4y ago

We occasionally ban companies that make HPC parts (Intel, NVIDIA, AMD) from selling to Chinese research centers, generally citing concerns that they could be used for weapons R&D (nuclear weapons simulation for example).

2015: https://spectrum.ieee.org/us-blacklisting-of-chinas-supercom...

2019: https://www.yahoo.com/now/trump-bans-more-chinese-tech-21140...

2021: https://www.bloomberg.com/news/articles/2021-04-08/u-s-adds-...

2 more replies

grawlinson4y ago

It'll probably be something like this[0] and this[1]. I think there are more export restrictions than these two examples.

[0]: https://en.wikipedia.org/wiki/Export_of_cryptography_from_th... [1]: https://edition.cnn.com/2020/12/18/tech/smic-us-sanctions-in...

zackmorris4y ago· 3 in thread

I want the opposite of this - a multicore CPU that runs on GPU or FPGA. Vortex looks really cool, but if they jump over a level of abstraction by only offering an OpenCL interface instead of access to the underlying cores, then I'm afraid I'm not interested.

I just need a chip that can run at least 256 streams of execution, each with their own local memory (virtualized to appear contiguous). This would initially be for running something like Docker, but would eventually run a concurrent version of something like GNU Octave (Matlab), or languages like Julia that at least make an attempt to self-parallelize. If there is a way to do this with Vortex, I'm all ears.

I've gone into this at length in my previous comments. The problem is that everyone jumped on the SIMD bandwagon when what we really wanted was MIMD. SIMD limits us to a very narrow niche of problems to solve like neural nets and rasterization. But it prevents us from discovering the emergent behavior of large stochastic networks running stuff like genetic algorithms or the elegant/simple algorithms like ray tracing. That's not handwaving, I'm being very specific here in what I'm saying, and feel that this domination of the market by a handful of profit chasers like Nvidia has set computing back at least 20 years.

JonChesterfield4y ago

I think this is available now. The waves/wavefronts on a GPU run independently. Communication between them isn't great, independent is better.

Given chips from a couple of years ago have ~64 compute units, each running ~32 wavefronts, your 256 target looks fine. It's one block of contiguous memory, but using it as 256 separate blocks would work great.

I don't know of a ready made language targeting the GPU like that.

klelatti4y ago

I may be missing something here but what do you mean by a CPU that runs on a GPU?

Also how does "256 streams of execution, each with their own local memory (virtualized to appear contiguous)" differ in practice from one of the recent CPUs with lots of cores - e.g. AMD / AWS Arm?

zackmorris4y ago

Well, this all goes back to when I was heavily into C++, assembly and blitters in the mid to late 90s when I was trying to run a shareware game business. I realized almost immediately that the real bottleneck in games is memory bandwidth, not processing power. This was right at the time that Quake III came out and everyone was trying to get a Voodoo2, I think it was? CPUs with FPUs had only gone mainstream maybe 5 years before that, and people were still arguing about Pentium vs 486 DX4. I was on Mac, but I don't think I even had a PowerPC yet.

Then everyone got video cards and CPU performance stopped improving almost overnight. Sure, we got 200 MHz Pentium IIs, and then Intel jumped warp speed into 1 GHz and then 2GHz and then 3 GHz... but single threaded performance wasn't any faster, and even today is only maybe 3 times faster than it was then, per clock cycle. What really happened is that all of the chip area went to branch prediction and caching.

When chips went from a few million transistors to a billion, I started asking why we couldn't just put dozens or hundreds of the old CPU cores on the new chips. As we all saw though, nobody listened or cared about that. So today we have behemoth chips that still choke when the web browser has a lot of tabs open.

Chips today have maybe 8 or 16 cores, and that's great. But it's 2 orders of magnitude less than the transistor budget could support. Apple's M1 is loosely trying to do what I'm asking. But it's making the mistake of having all of these proprietary/dedicated cores for SIMD stuff. I would scrap all of that, and go with a 2D array of general-purpose cores, each with their own local memories, communicating using web metaphors like content-addressable memory.

In fairness, I think the reason that real multicore CPUs never caught on, is that we didn't have the languages to utilize them. But today we have Matlab and various Lisps and higher order methods that auto-parallelize loops by treating them as transformations on arrays. All of our languages should have been auto-parallelized by now anyway. And not with SIMD optimization magic, I mean by statically analyzing code and converting it all first into higher order methods, then optimizing that intermediate code (I-code) so that the block copies are spread over multiple cores and memories. I can't remember the term for this, it's basically divide and conquer though, for example if fork/join scope was limited to a single function by the runtime. Scatter gather and map reduce are other terms for this.

So right now we have to deal with promises and async and other patterns (I consider patterns an anti-pattern) when we could just be using an ordinary language like Javascript or C, auto-parallelized to run on 256+ cores with something like terabytes per second of bandwidth, running many thousands of times faster than computers today, for far less effort because it appears as a single thread of execution. Then OpenCL or OpenGL or anything else could run like any other library above that, for people that prefer a higher-level interface.

1 more reply

chalcolithic4y ago· 3 in thread

Wow! Just add NaNboxing support (for JavaScript and possibly other dynamic languages) and it'll be a CPU I dreamed about.

sitkack4y ago

For those unfamiliar with Nan-boxing

https://anniecherkaev.com/the-secret-life-of-nan

> One use is NaN-boxing, which is where you stick all the other non-floating point values in a language + their type information into the payload of NaNs. It’s a beautiful hack.

nynx4y ago

It’s a GPU.

chalcolithic4y ago

Yes and I wanted a GPU-style CPU that could handle all the tasks in the system so no host CPU is necessary

pabs34y ago· 2 in thread

OpenCL seems to be kind of dying (eg Blender abandoned it), I wonder what is going to replace it.

my1234y ago

CUDA is what ended up replacing it, or rather, OpenCL had always failed to make a dent over the long term.

(with AMD ROCm being a CUDA API clone, without the PTX layer)

DeathArrow4y ago

But is there anyone using ROCm in production? Is ROCm up to par with CUDA?

1 more reply

NotCamelCase4y ago· 2 in thread

This is an amazing project considering the scope of work required on both sides of the aisle -- HW and SW.

I find choice of RISC-V pretty interesting for this use case as it's a fixed-size ISA and there is a significant amount of of auxiliary data usually passed from drivers to HW in typical GPU settings, even for GPGPU scenarios alone. If you look at one of their papers, it shows how they pass extra texture parameters via CSRs. I think this might come to be bottleneck and limiting factor in the design for future expansions. I am currently doing a similar work (>10x smaller in comparison) on a more limited feature set, so I am really curious how it'll turn out to be.

zozbot2344y ago

RISC-V is not "fixed size", the encoding has room for larger instructions (48-bit, 64-bit or more).

NotCamelCase4y ago

I guess you're referring to variable-length encodings support? It's fixed as in they only implement RV32IMF subset here. Even then, code density may be source of bottlenecks along the way.

R0b0t14y ago· 2 in thread

I've tried looking up the hardware they run on. Anyone have a price?

detaro4y ago

Expensive. Exact parts aren't clear, but hundreds of dollars for a single chip and 5k+ for a devkit from a quick look?

But running on FPGA is really only the testing stage for putting it in an ASIC if something like this wants to be competitive in any way.

R0b0t14y ago

I thought so. Hundreds for the chip isn't insane (depending on how many) but $5k for the dev kit, oof.

1 more reply

throwaway815234y ago· 2 in thread

A GPGPU in an FPGA. Interesting, but 100x slower than a commodity AMD or NVidia card.

fahadkhan4y ago

It's a research project. It's open source. FPGA are often used for developing hardware. If it gets good enough for someone's use case, they will print the chips.

gumby4y ago

Perfect way to prototype hardware.

raphlinus4y ago· 1 in thread

This is a research project from Georgia Tech. There's a homepage at [0] and a paper at [1]. It is specialized to run OpenCL, but with a bit of support for the graphics pipeline, mostly a texture fetch instruction. It appears to be fairly vanilla RISC-V overall, with a small number of additional instructions to support GPU. I'm very happy to see this kind of thing, as I think there's a lot of design space to explore, and it's great that some of that is happening in academic spaces.

[0]: https://vortex.cc.gatech.edu/

[1]: https://vortex.cc.gatech.edu/publications/vortex_micro21_fin...

hajile4y ago

Intel's Larabee/Xeon Phi shows that there's a ton of potential here.

Intel's big issue is that x86 is incredibly inefficient. Implementing the base instruction set is very difficult. Trying to speed it up at all starts drastically increasing core size. This means that the SIMD to overhead ratio is pretty high

RISC-V excels at tiny implementations and power efficiency. The ratio of SIMD to the rest of the core should be much higher resulting in overall better efficiency.

The final design (at a high level) seems somewhat similar to AMD's RDNA with a scalar ALU doing the flow control while a very wide SIMD does the bulk of the calculations.

ksec4y ago· 1 in thread

Nice, instead of trying to tackle the CPU space RISC-V should really be doing more work on GPGPU space with open source drivers.

Current GPU are the biggest blackbox and mystery in modern computing.

sitkack4y ago

RISC-V (with no vectors) was the base ISA to support the real goal of making Vector processors, it was supposed to be a short side quest. Much longer than expected but 1000% worth it.

RVV (RISC-V Vector Extension) is the real coup and ultimately what the base ISA is there to support.

https://youtu.be/V7fuE1yXUxk?t=104

https://www.youtube.com/watch?v=oTaOd8qr53U

GPUs might be complex beasts, ultimately it is lots of FMAs (Fused Multiply Add) that do most of our calculations.

https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_op...

bullen4y ago

I think there is another project doing RISC-V GPU: https://www.pixilica.com/graphics

Also the latest announced RVB-ICE should have a OpenGL ES 3+ capable Vivante GC8000UL GPU (did not manage to find documentation for this exact version but all GC8000 seem to): https://www.aliexpress.com/item/1005003395978459.html

Disclaimer: Expensive if you don't know if it's vapourware and how the drivers and linux works!

d_tr4y ago

The two supported FPGA families are a blessing for this kind of project, since they have hardware floating-point units. Unfortunately they are quite expensive, like the Xilinx ones with this feature...

j / k navigate · click thread line to collapse

58 comments

39 comments · 12 top-level

unsigner4y ago· 8 in thread

We should really have another word for “chip that runs OpenCL but has no rasterizer”.

raphlinus4y ago

Overall, if we can get simpler hardware with more of a focus on compute power, I think that's a good thing, and I think it's completely fine to call that a GPU.

[1]: https://research.nvidia.com/publication/high-performance-sof...

justsid4y ago

zbendefy4y ago

OpenCL has a category called CL_ DEVICE_ TYPE_ ACCELERATOR for that, so something like 'Accelerator' seems to fit

avianes4y ago

That term is "SIMT architecture."

Modern GPUs (or GPGPUs) are based on the SIMT programming model that requires an SIMT architecture

zozbot2344y ago

1 more reply

nine_k4y ago

Vector processors? Follow the early Cray nomenclature.

avianes4y ago

The terminology "vector processor" refers to a completely different type of architecture. Using it for SIMT architecture would be confusing

1 more reply

eqvinox4y ago

VPU? With network processors being called NPU these days...

1 more reply

akmittal4y ago· 3 in thread

It great to see RISC-V making a lot of progress. A lot of research is coming from China because of US bans, but hopefully this will be good for whole world.

zucker424y ago

Which U.S. bans are you talking about? Is there anywhere I can read more about this?

bee_rider4y ago

2015: https://spectrum.ieee.org/us-blacklisting-of-chinas-supercom...

2019: https://www.yahoo.com/now/trump-bans-more-chinese-tech-21140...

2021: https://www.bloomberg.com/news/articles/2021-04-08/u-s-adds-...

2 more replies

grawlinson4y ago

It'll probably be something like this[0] and this[1]. I think there are more export restrictions than these two examples.

[0]: https://en.wikipedia.org/wiki/Export_of_cryptography_from_th... [1]: https://edition.cnn.com/2020/12/18/tech/smic-us-sanctions-in...

zackmorris4y ago· 3 in thread

JonChesterfield4y ago

I think this is available now. The waves/wavefronts on a GPU run independently. Communication between them isn't great, independent is better.

I don't know of a ready made language targeting the GPU like that.

klelatti4y ago

I may be missing something here but what do you mean by a CPU that runs on a GPU?

Also how does "256 streams of execution, each with their own local memory (virtualized to appear contiguous)" differ in practice from one of the recent CPUs with lots of cores - e.g. AMD / AWS Arm?

zackmorris4y ago

1 more reply

chalcolithic4y ago· 3 in thread

Wow! Just add NaNboxing support (for JavaScript and possibly other dynamic languages) and it'll be a CPU I dreamed about.

sitkack4y ago

For those unfamiliar with Nan-boxing

https://anniecherkaev.com/the-secret-life-of-nan

> One use is NaN-boxing, which is where you stick all the other non-floating point values in a language + their type information into the payload of NaNs. It’s a beautiful hack.

nynx4y ago

It’s a GPU.

chalcolithic4y ago

Yes and I wanted a GPU-style CPU that could handle all the tasks in the system so no host CPU is necessary

pabs34y ago· 2 in thread

OpenCL seems to be kind of dying (eg Blender abandoned it), I wonder what is going to replace it.

my1234y ago

CUDA is what ended up replacing it, or rather, OpenCL had always failed to make a dent over the long term.

(with AMD ROCm being a CUDA API clone, without the PTX layer)

DeathArrow4y ago

But is there anyone using ROCm in production? Is ROCm up to par with CUDA?

1 more reply

NotCamelCase4y ago· 2 in thread

This is an amazing project considering the scope of work required on both sides of the aisle -- HW and SW.

zozbot2344y ago

RISC-V is not "fixed size", the encoding has room for larger instructions (48-bit, 64-bit or more).

NotCamelCase4y ago

I guess you're referring to variable-length encodings support? It's fixed as in they only implement RV32IMF subset here. Even then, code density may be source of bottlenecks along the way.

R0b0t14y ago· 2 in thread

I've tried looking up the hardware they run on. Anyone have a price?

detaro4y ago

Expensive. Exact parts aren't clear, but hundreds of dollars for a single chip and 5k+ for a devkit from a quick look?

But running on FPGA is really only the testing stage for putting it in an ASIC if something like this wants to be competitive in any way.

R0b0t14y ago

I thought so. Hundreds for the chip isn't insane (depending on how many) but $5k for the dev kit, oof.

1 more reply

throwaway815234y ago· 2 in thread

A GPGPU in an FPGA. Interesting, but 100x slower than a commodity AMD or NVidia card.

fahadkhan4y ago

It's a research project. It's open source. FPGA are often used for developing hardware. If it gets good enough for someone's use case, they will print the chips.

gumby4y ago

Perfect way to prototype hardware.

raphlinus4y ago· 1 in thread

[0]: https://vortex.cc.gatech.edu/

[1]: https://vortex.cc.gatech.edu/publications/vortex_micro21_fin...

hajile4y ago

Intel's Larabee/Xeon Phi shows that there's a ton of potential here.

RISC-V excels at tiny implementations and power efficiency. The ratio of SIMD to the rest of the core should be much higher resulting in overall better efficiency.

The final design (at a high level) seems somewhat similar to AMD's RDNA with a scalar ALU doing the flow control while a very wide SIMD does the bulk of the calculations.

ksec4y ago· 1 in thread

Nice, instead of trying to tackle the CPU space RISC-V should really be doing more work on GPGPU space with open source drivers.

Current GPU are the biggest blackbox and mystery in modern computing.

sitkack4y ago

RISC-V (with no vectors) was the base ISA to support the real goal of making Vector processors, it was supposed to be a short side quest. Much longer than expected but 1000% worth it.

RVV (RISC-V Vector Extension) is the real coup and ultimately what the base ISA is there to support.

https://youtu.be/V7fuE1yXUxk?t=104

https://www.youtube.com/watch?v=oTaOd8qr53U

GPUs might be complex beasts, ultimately it is lots of FMAs (Fused Multiply Add) that do most of our calculations.

https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_op...

bullen4y ago

I think there is another project doing RISC-V GPU: https://www.pixilica.com/graphics

Disclaimer: Expensive if you don't know if it's vapourware and how the drivers and linux works!

d_tr4y ago

j / k navigate · click thread line to collapse