GPU Puzzles (opens in new tab)

(github.com)

356 pointscgadski1y ago40 comments

40 comments

34 comments · 12 top-level

srush1y ago· 7 in thread

I made these a couple of years ago as a teaching exercise for https://minitorch.github.io/. At the time the resources for doing anything on GPUs were pretty sparse and the NVidia docs were quite challenging.

These days there are great resources for going deep on this topic. The CUDA-mode org is particularly great, both their video series and PMPP reading groups.

nextos1y ago

Slightly offtopic, but any chance you could update or re-upload code for your https://github.com/harvardnlp/DeepLatentNLP tutorial? I found the NLP latent variable models discussed there really interesting, and notebooks were excellent. However, these seem gone and the only thing left are slides?

Alternatively, any other places that discuss the same topics, including some code? I could only find equivalent discussions with code in Pyro docs and Kevin Murphy's book, volume 2. But these are more sparse as they also cover many other topics.

srush1y ago

I'll take a look. Yeah Pyro is the best thing to do here. But it would be nice to revisit some of these implementationz

1 more reply

bytepoet1y ago

Thanks a lot, Sasha, for creating these. I found your LLM training puzzles to be excellent as well.

srush1y ago

Awesome! Here are all of them if anyone else is looking.

https://github.com/srush/Triton-puzzles https://github.com/srush/tensor-puzzles https://github.com/srush/autodiff-puzzles https://github.com/srush/transformer-puzzles https://github.com/srush/GPTworld https://github.com/srush/LLM-Training-Puzzles

lins19091y ago

Thanks Sasha - this looks like a great resource.Just to be clear, would you recommend going through other newer resources than this instead?

Not sure if your comment is to discourage someone from going through this.

srush1y ago

These still hold up, and I think they're a great first step. But they no longer get you to the goal line. Think about it more as conceptual practice, before you enter the jungle.

1 more reply

olive2471y ago

Do you have links to the other great resources you are referring to?

867-53091y ago· 4 in thread

seems like an opportune moment to gift a plug for bitcoin puzzles, namely BTC32 / 1000 BTC Challenge[1]

pools are in dire need of cuda developers

[1]https://bitcointalk.org/index.php?topic=1306983.0

talldayo1y ago

> pools are in dire need of cuda developers

Pools have money; if they need CUDA engineers, they are fully capable of hiring them at the industry rate.

867-53091y ago

most are community-based, plus, the prize can far exceed such a rate

1 more reply

jamilton1y ago

Why? Wouldn't existing tools be about as good as they could be?

867-53091y ago

they are stagnating due to the logarithmic increase in difficulty

wmil1y ago· 2 in thread

So I'm used to working with lists and maps, which doesn't really track well with tackling problems on thousands of cores.

Is the usual strategy to worry less about repeating calculations and just use brute force to tackle the problem?

Is there a good resource to read about how to tackle problems in an extremely parallel way?

dahart1y ago

It’s not about brute force, but about trying to do the exact same calculation on every thread. Efficiency in numpy or on a GPU comes from avoiding “divergence”, which is what it’s called on a GPU when some threads execute different instructions than other threads in the same thread group. If one thread executes a unique instruction, all the other threads have to stall and wait for it. If all the threads execute unique blocks, the waiting becomes catastrophic and slower than single-theaded code. But if they all do the same thing, the machine will fly. Sometimes avoiding divergence means doing things that seem counter-intuitive compared to CPU single-threaded code, which is why it has a reputation for being brute force, but really it’s just a different set of efficiency tricks.

It is true that you don’t have to worry as much about repeating calculations. I think you’re referring to “rematerialization”, meaning after doing some non-trivial calculation once and using the result, throwing it away and redoing the same calculation again later on the same thread. It’s true this can sometimes be advantageous, mostly because memory use is so expensive. One load or store into VRAM can be as expensive as 10 or sometimes even 100 math instructions, so if your store & load takes 40 cycles, and recomputing something takes 25 cycles of math using registers, then recomputing can be faster.

I second the sibling recommendation to learn numpy, it’s a different way of thinking than single-threaded functional programming with lists & maps. Try writing some kind of image filter in Python both ways, and get a feel for the performance difference. If you’re familiar with Python, this is a one or two hour exercise. Last time I tried it, my numpy version was ~2 orders of magnitude faster than the lists & maps version.

One of the most fun ways to learn SIMD programming, in my humble opinion, is to study the shaders on ShaderToy. ShaderToy makes it super simple to write GPU code and see the result. Some of the tricks people use are very clever, but after studying them for a while and trying a few yourself, you’ll start to see themes emerge about how to organize parallel image computations.

srush1y ago

I would recommend first learning Numpy or a similar vectorized library. If you have a good sense of those data structures (array broadcasting) it is a good starting point for what you can do in a GPU world.

xandrius1y ago· 2 in thread

Looks nice and fun but the "see-through" font for the titles in the screenshots gives me some deep and primordial unease, not sure why.

fifilura1y ago

https://en.wikipedia.org/wiki/Astigmatism

Maybe?

xandrius1y ago

This: https://github.com/srush/GPU-Puzzles/raw/main/GPU_puzzlers_f...

1 more reply

aleinin1y ago· 1 in thread

I recently ported this to Metal for Apple Silicon computers. If you're interested in learning GPU programming on an M series Mac, I think this is a very accessible option. Thanks to Sasha for making this!

https://github.com/abeleinin/Metal-Puzzles

negativeonehalf1y ago

Wow, thank you! I've been wanting to learn about GPUs on my next flight, and this is the perfect material for that.

fifilura1y ago· 1 in thread

I think this course is also relevant for some deeper context.

https://gfxcourses.stanford.edu/cs149/fall23/lecture/datapar...

geekodour1y ago

all videos should be available on YT by end of month

saagarjha1y ago· 1 in thread

When working on GPU code there’s really two parts to it, I feel. One is “how do I even write code for the GPU” which this tutorial seems to cover but there’s a second part which is “how do I write good code for the GPU” which seems like it would need another resource or expansion to this one.

derefr1y ago

I've always felt like the best interactive educational model for forming a good intuition on how to maximize throughput and minimize worst-case latency in a pipelined parallel dataflow system (e.g. DSPs, FPGAs, GPUs, or even distributed message-passing systems) would be some variant of the game Factorio. Specifically, one with:

1. instead of buildings, IP cores doing processing steps;

2. instead of belts, wires — which take up far less than one tile, so many can run together along one tile and many can connect to a single IP core; where each wire can move its contents at arbitrary speed (including "stopped") — but where this will have a power-use cost proportional to the wire's speed;

3. an overall goal of optimizing for rocket launches per second per power-usage watt. (Which should overall require minimizing the amount of stuff moving around across the whole base, avoiding pipeline stalls; doing as much parallel batching as possible; etc.)

(Yes, I know Shenzhen I/O exists. It's great for what it does — modelling signals and signal transformations — but it doesn't model individual packets of data as moving along wires with propagation delay, and with the potential for e.g. parallel-line interference given a bad encoding scheme, quantum tunnelling, overclocking or undervolting components, etc. I think a Factorio-variant would actually be much more flexible to implement these aspects.)

ismailmaj1y ago· 1 in thread

It would be nice if the puzzles natively supported C++ CUDA.

srush1y ago

Here is a port without the visualizer:

https://twitter.com/srush_nlp/status/1719376959572980094

Here is an amazing in-browser implementation in WebGPU

https://www.answer.ai/posts/2024-09-12-gpupuzzles.html

czhu121y ago· 1 in thread

I loved the tensor puzzles you made. I spent the morning revisiting and liking all the videos on youtube you've made. Hope for many more in the future!

srush1y ago

Thanks so much!

throwaway3141551y ago· 1 in thread

Either puzzle 4 has a bug in it or I'm losing my mind. (Possible answer to solution below, so don't read if you want to go in fresh)

    # FILL ME IN (roughly 2 lines)
    if local_i < size and local_j < size:
        out[local_i][local_j] = a[local_i][local_j] + 10

Results in a failed assertion:

     AssertionError: Wrong number of indices

But the test cell beneath it will still pass?

imjonse1y ago

maybe try out[local_i, local_j] ?

az2261y ago· 1 in thread

Can I hire you to make Flash Attention a reality for V100?

srush1y ago

Nope! Too hard for me. But it would be a great practice for someone who wants to get started in this space. There is a Triton implementation that might be a good starting place.

dejanig1y ago

Wow, It looks realy interesting, I will definitely look into it.

j / k navigate · click thread line to collapse

40 comments

34 comments · 12 top-level

srush1y ago· 7 in thread

These days there are great resources for going deep on this topic. The CUDA-mode org is particularly great, both their video series and PMPP reading groups.

nextos1y ago

srush1y ago

I'll take a look. Yeah Pyro is the best thing to do here. But it would be nice to revisit some of these implementationz

1 more reply

bytepoet1y ago

Thanks a lot, Sasha, for creating these. I found your LLM training puzzles to be excellent as well.

srush1y ago

Awesome! Here are all of them if anyone else is looking.

lins19091y ago

Thanks Sasha - this looks like a great resource.Just to be clear, would you recommend going through other newer resources than this instead?

Not sure if your comment is to discourage someone from going through this.

srush1y ago

These still hold up, and I think they're a great first step. But they no longer get you to the goal line. Think about it more as conceptual practice, before you enter the jungle.

1 more reply

olive2471y ago

Do you have links to the other great resources you are referring to?

867-53091y ago· 4 in thread

seems like an opportune moment to gift a plug for bitcoin puzzles, namely BTC32 / 1000 BTC Challenge[1]

pools are in dire need of cuda developers

[1]https://bitcointalk.org/index.php?topic=1306983.0

talldayo1y ago

> pools are in dire need of cuda developers

Pools have money; if they need CUDA engineers, they are fully capable of hiring them at the industry rate.

867-53091y ago

most are community-based, plus, the prize can far exceed such a rate

1 more reply

jamilton1y ago

Why? Wouldn't existing tools be about as good as they could be?

867-53091y ago

they are stagnating due to the logarithmic increase in difficulty

wmil1y ago· 2 in thread

So I'm used to working with lists and maps, which doesn't really track well with tackling problems on thousands of cores.

Is the usual strategy to worry less about repeating calculations and just use brute force to tackle the problem?

Is there a good resource to read about how to tackle problems in an extremely parallel way?

dahart1y ago

srush1y ago

xandrius1y ago· 2 in thread

Looks nice and fun but the "see-through" font for the titles in the screenshots gives me some deep and primordial unease, not sure why.

fifilura1y ago

https://en.wikipedia.org/wiki/Astigmatism

Maybe?

xandrius1y ago

This: https://github.com/srush/GPU-Puzzles/raw/main/GPU_puzzlers_f...

1 more reply

aleinin1y ago· 1 in thread

https://github.com/abeleinin/Metal-Puzzles

negativeonehalf1y ago

Wow, thank you! I've been wanting to learn about GPUs on my next flight, and this is the perfect material for that.

fifilura1y ago· 1 in thread

I think this course is also relevant for some deeper context.

https://gfxcourses.stanford.edu/cs149/fall23/lecture/datapar...

geekodour1y ago

all videos should be available on YT by end of month

saagarjha1y ago· 1 in thread

derefr1y ago

1. instead of buildings, IP cores doing processing steps;

ismailmaj1y ago· 1 in thread

It would be nice if the puzzles natively supported C++ CUDA.

srush1y ago

Here is a port without the visualizer:

https://twitter.com/srush_nlp/status/1719376959572980094

Here is an amazing in-browser implementation in WebGPU

https://www.answer.ai/posts/2024-09-12-gpupuzzles.html

czhu121y ago· 1 in thread

I loved the tensor puzzles you made. I spent the morning revisiting and liking all the videos on youtube you've made. Hope for many more in the future!

srush1y ago

Thanks so much!

throwaway3141551y ago· 1 in thread

Either puzzle 4 has a bug in it or I'm losing my mind. (Possible answer to solution below, so don't read if you want to go in fresh)

    # FILL ME IN (roughly 2 lines)
    if local_i < size and local_j < size:
        out[local_i][local_j] = a[local_i][local_j] + 10

Results in a failed assertion:

     AssertionError: Wrong number of indices

But the test cell beneath it will still pass?

imjonse1y ago

maybe try out[local_i, local_j] ?

az2261y ago· 1 in thread

Can I hire you to make Flash Attention a reality for V100?

srush1y ago

Nope! Too hard for me. But it would be a great practice for someone who wants to get started in this space. There is a Triton implementation that might be a good starting place.

dejanig1y ago

Wow, It looks realy interesting, I will definitely look into it.

j / k navigate · click thread line to collapse