Hello World on the GPU (2019) (opens in new tab)

(acko.net)

95 pointsthdespou2y ago42 comments

42 comments

19 comments · 5 top-level

As of this year (ish), `int main() {puts("hello, world\n");}` stands a decent chance of running on a GPU and doing the right thing if you compile it with clang. Terminal application style. Should be able to spell it printf shortly, variadic functions turn out to be a bit of a mess.

pjmlp2y ago

CUDA already does printf, and C++20 support, minus modules.

JonChesterfield2y ago

C++20 would be news to me. Do you have a reference? The closest I can find is https://github.com/NVIDIA/cccl which seems to be atomic and bits of algorithm. E.g. can you point to unordered_map that works on the target?

I think some pieces of libc++ work but don't know of any testing or documentation effort to track what parts, nor of any explicit handling in the source tree.

2 more replies

krackers2y ago

What does that do under the hood though? What does it mean to execute puts from a GPU?

JonChesterfield2y ago

Libc on x64 is roughly a bunch of userspace code over syscall which traps into the kernel. Looks like a function that takes six integer registers and writes results to some of those same registers.

Libc on nvptx or amdgpu is a bunch of userspace code over syscall, which is a function that takes eight integers per lane on the GPU. That "syscall" copies those integers to the x64/host/other architecture. You'll find it in a header called rpc.h, the same code compiled on host or GPU. Sometime later a thread on the host reads those integers, does whatever they asked for (e.g. call the host syscall on the next six integers), possibly copies values back.

Puts probably copies the string to the host 7*8 bytes at a time, reassembles it on the host, then passes it to the host implementation of puts. We should be able to kill the copy on some architectures. Some other functions run wholly on the GPU, e.g. sprintf shouldn't talk to the host, but fprintf will need to.

The GPU libc is fun from a design perspective because it can run code on either side of that communication channel as we see fit. E.g. printf floating point handling seems prone to large numbers of registers needed on the GPU at the moment so we may move some work to the host to make the register usage better (higher occupancy).

KeplerBoy2y ago

Do you happen to have a link to these developments?

JonChesterfield2y ago

Documentation is lagging reality a bit, we'll probably fix that around the next llvm release. Some information is at https://libc.llvm.org/gpu/using.html

That GPU libc is mostly intended to bring things like fopen to openmp or cuda, but it turns out GPUs are totally usable as bare metal embedded targets. You can read/write to "host" memory, on that and a thread running on the host you can implement a syscall equivalent (e.g. https://dl.acm.org/doi/10.1145/3458744.3473357), and once you have syscall the doors are wide open. I particularly like mmap from GPU kernels.

2 more replies

hutzlibu2y ago· 6 in thread

"Graphics programming can be intimidating. It involves a fair amount of math, some low-level code, and it's often hard to debug. Nevertheless I'd like to show you how to do a simple "Hello World" on the GPU. You will see that there is in fact nothing to be afraid of."

57 created objects later

"Hm. Damn"

Well .. there is a reason it is usually "hello triangle" on GPU tutorials. Spoiler alert, GPUs ain't easy.

pjmlp2y ago

3D APIs were easier back in the yearly days.

Nowadays using "legacy" APIs is relatively easy, however it requires a background knowledge on how GL became GL 4.6, DX became DX 11 and so.

Modern APIs are super low level, they are designed as GPU APIs for driver writers basically. Since they cut the fat legacy API drivers used to take care for the applications, now everyone has to deal with such complexity directly, or make use of a middleware engine instead.

kimixa2y ago

What's wrong with using middleware engines? That was pretty much the expected case for people who didn't /need/ the level of control exposed by APIs like vulkan or dx12.

It's pretty much what OpenGL "drivers" were doing past the introduction of hardware shaders anyway - acting as a pretty thick middleware translating that to low-level commands, only having the user API being locked into a design from decades ago.

And considering how hard it was to get Khronos to eventually agree on vulkan in the first place (that effectively being a drop from AMD in "Mantle" then only tweaked by the committee), I'm not surprised they haven't standardized a higher-level API. So third party middleware it is.

1 more reply

MaxBarraclough2y ago

> now everyone has to deal with such complexity directly, or make use of a middleware engine instead

You don't really have to though, you can still use the higher-level older graphics APIs. It wouldn't have made much sense for Vulkan to include a high-level graphics API as well, as those APIs already exist and have mature ecosystems.

Similarly, in Windows land, you aren't forced to use D3D12, you can still use D3D11 or even D3D9.

1 more reply

raytopia2y ago

Well if you use glBegin it's pretty easy.

glBegin(GL_TRIANGLES);

  glVertex3f( 0.0f, 1.0f, 0.0f);              
  glVertex3f(-1.0f,-1.0f, 0.0f);              
  glVertex3f( 1.0f,-1.0f, 0.0f);

glEnd();

And there you go you got a triangle.

It's great for beginners because they can see the results very fast and once they want to start having crazy graphical effects or need more performance you can move to shaders.

jacquesm2y ago

The hardware was designed to display triangles, not to do 'Hello world'.

So it's not all that surprising that the one is easier than the other, in a way it is surprising that the other can be done at all. But as CPUs and GPUs converge it's quite possible that NV or another manufacturer eventually slips enough general purpose capacity onto their cards that they function as completely separate systems. And then 'Hello world' will be trivial.

hutzlibu2y ago

Erm yes. A triangle is quite easy .. but here they tried a simple tutorial to actually print "Hello World" .. and surprise, it wasn't easy and in the end just stops.

1 more reply

dragontamer2y ago· 2 in thread

There's a degree of GPU-style going on here, but its not OpenGL or DirectX.

  for y in 0..height {
    for x in 0..width {

      // Get target position
      let tx = x + offset;
      let ty = y;

So this code, in a language I'm not too familiar with, is clearly a GPU concept. Except, this 2-dimensional for-loop is executed in parallel on modern GPUs in the so-called pixel-shader.

A Pixel-shader is all sorts of complications in practice that deserves at least a few days of studying the rendering pipeline to understand. But the tl;dr is that a pixel-shader launches a thread (erm... a SIMD-lane? A... work-item? A shader?) per pixel, and then the device drivers do some magic to group them together.

Like, in the raw hardware, pixel0-0 is going to be rendered at the same time as pixel0-1, pixel0-2, etc. etc. And the values inside of this "for loop" are the code that runs it all.

Sure its SIMD and all kinds of complicated to fully describe what's going on here. But the bulk of GPU-programming (or at least, for pixel shaders), is recognizing the one-thread-per-pixel (erm, SIMD-lane per pixel) approach.

------------------

Anyway, I think this post is... GPU-enough. I'm not sure if this truly executes on a GPU given how the code was written. But I'd give it my stamp of approval as far as "Describing code as if it were being done on a GPU", even if they're cheating for simplicity in many spots.

The #1 most important part is that the "rasterize" routine is written in the embarrassingly parallel mindset. Every pixel "could" in theory, be processed in parallel. (Notice that no pixels have race-conditions or locks, or sequencing needed with each other).

And the #2 part is having the "sequential" CPU-code logically and seamlessly communicate with the "embarrassingly parallel" rasterize routine in a simple, logical, and readable manner. And this post absolutely accomplishes that.

Its harder to write this cleanly than it looks. But having someone show you, as per this post, how it is done helps with the learning process.

pjmlp2y ago

It is a Rust application making use of wgpu, Rust's WebGPU native library.

dragontamer2y ago

Nope.

Pixel shaders in WebGPU / wgpu are written in WGSL. The above 2-dimensional for-loop is _NOT_ a proper pixel shader (but it is written in a "Pixel Shader style", very familiar to any GPU programmer).

2 more replies

raytopia2y ago

Great parody of WebGPU and other low level graphics apis.

runetech2y ago

If nothing else, I am grateful for the introduction to Selah Sue (music that plays when you press, well.. the play symbol in the top animation).

Spectacular vibe! Combined with the fullscreen animation is almost reminiscent of the demo-scene. I enjoyed the rest of the actual web page much more after that.

I salute thee whoever made this. Much appreciated!

j / k navigate · click thread line to collapse

42 comments

19 comments · 5 top-level

JonChesterfield2y ago· 6 in thread

pjmlp2y ago

CUDA already does printf, and C++20 support, minus modules.

JonChesterfield2y ago

I think some pieces of libc++ work but don't know of any testing or documentation effort to track what parts, nor of any explicit handling in the source tree.

2 more replies

krackers2y ago

What does that do under the hood though? What does it mean to execute puts from a GPU?

JonChesterfield2y ago

Libc on x64 is roughly a bunch of userspace code over syscall which traps into the kernel. Looks like a function that takes six integer registers and writes results to some of those same registers.

KeplerBoy2y ago

Do you happen to have a link to these developments?

JonChesterfield2y ago

Documentation is lagging reality a bit, we'll probably fix that around the next llvm release. Some information is at https://libc.llvm.org/gpu/using.html

2 more replies

hutzlibu2y ago· 6 in thread

57 created objects later

"Hm. Damn"

Well .. there is a reason it is usually "hello triangle" on GPU tutorials. Spoiler alert, GPUs ain't easy.

pjmlp2y ago

3D APIs were easier back in the yearly days.

Nowadays using "legacy" APIs is relatively easy, however it requires a background knowledge on how GL became GL 4.6, DX became DX 11 and so.

kimixa2y ago

What's wrong with using middleware engines? That was pretty much the expected case for people who didn't /need/ the level of control exposed by APIs like vulkan or dx12.

1 more reply

MaxBarraclough2y ago

> now everyone has to deal with such complexity directly, or make use of a middleware engine instead

Similarly, in Windows land, you aren't forced to use D3D12, you can still use D3D11 or even D3D9.

1 more reply

raytopia2y ago

Well if you use glBegin it's pretty easy.

glBegin(GL_TRIANGLES);

  glVertex3f( 0.0f, 1.0f, 0.0f);              
  glVertex3f(-1.0f,-1.0f, 0.0f);              
  glVertex3f( 1.0f,-1.0f, 0.0f);

glEnd();

And there you go you got a triangle.

It's great for beginners because they can see the results very fast and once they want to start having crazy graphical effects or need more performance you can move to shaders.

jacquesm2y ago

The hardware was designed to display triangles, not to do 'Hello world'.

hutzlibu2y ago

Erm yes. A triangle is quite easy .. but here they tried a simple tutorial to actually print "Hello World" .. and surprise, it wasn't easy and in the end just stops.

1 more reply

dragontamer2y ago· 2 in thread

There's a degree of GPU-style going on here, but its not OpenGL or DirectX.

  for y in 0..height {
    for x in 0..width {

      // Get target position
      let tx = x + offset;
      let ty = y;

So this code, in a language I'm not too familiar with, is clearly a GPU concept. Except, this 2-dimensional for-loop is executed in parallel on modern GPUs in the so-called pixel-shader.

Like, in the raw hardware, pixel0-0 is going to be rendered at the same time as pixel0-1, pixel0-2, etc. etc. And the values inside of this "for loop" are the code that runs it all.

------------------

Its harder to write this cleanly than it looks. But having someone show you, as per this post, how it is done helps with the learning process.

pjmlp2y ago

It is a Rust application making use of wgpu, Rust's WebGPU native library.

dragontamer2y ago

Nope.

Pixel shaders in WebGPU / wgpu are written in WGSL. The above 2-dimensional for-loop is _NOT_ a proper pixel shader (but it is written in a "Pixel Shader style", very familiar to any GPU programmer).

2 more replies

raytopia2y ago

Great parody of WebGPU and other low level graphics apis.

runetech2y ago

If nothing else, I am grateful for the introduction to Selah Sue (music that plays when you press, well.. the play symbol in the top animation).

Spectacular vibe! Combined with the fullscreen animation is almost reminiscent of the demo-scene. I enjoyed the rest of the actual web page much more after that.

I salute thee whoever made this. Much appreciated!

j / k navigate · click thread line to collapse