These days there are great resources for going deep on this topic. The CUDA-mode org is particularly great, both their video series and PMPP reading groups.
Alternatively, any other places that discuss the same topics, including some code? I could only find equivalent discussions with code in Pyro docs and Kevin Murphy's book, volume 2. But these are more sparse as they also cover many other topics.
https://github.com/srush/Triton-puzzles https://github.com/srush/tensor-puzzles https://github.com/srush/autodiff-puzzles https://github.com/srush/transformer-puzzles https://github.com/srush/GPTworld https://github.com/srush/LLM-Training-Puzzles
Not sure if your comment is to discourage someone from going through this.
https://gfxcourses.stanford.edu/cs149/fall23/lecture/datapar...
1. instead of buildings, IP cores doing processing steps;
2. instead of belts, wires — which take up far less than one tile, so many can run together along one tile and many can connect to a single IP core; where each wire can move its contents at arbitrary speed (including "stopped") — but where this will have a power-use cost proportional to the wire's speed;
3. an overall goal of optimizing for rocket launches per second per power-usage watt. (Which should overall require minimizing the amount of stuff moving around across the whole base, avoiding pipeline stalls; doing as much parallel batching as possible; etc.)
(Yes, I know Shenzhen I/O exists. It's great for what it does — modelling signals and signal transformations — but it doesn't model individual packets of data as moving along wires with propagation delay, and with the potential for e.g. parallel-line interference given a bad encoding scheme, quantum tunnelling, overclocking or undervolting components, etc. I think a Factorio-variant would actually be much more flexible to implement these aspects.)
https://twitter.com/srush_nlp/status/1719376959572980094
Here is an amazing in-browser implementation in WebGPU
# FILL ME IN (roughly 2 lines)
if local_i < size and local_j < size:
out[local_i][local_j] = a[local_i][local_j] + 10
Results in a failed assertion: AssertionError: Wrong number of indices
But the test cell beneath it will still pass?Is the usual strategy to worry less about repeating calculations and just use brute force to tackle the problem?
Is there a good resource to read about how to tackle problems in an extremely parallel way?
It is true that you don’t have to worry as much about repeating calculations. I think you’re referring to “rematerialization”, meaning after doing some non-trivial calculation once and using the result, throwing it away and redoing the same calculation again later on the same thread. It’s true this can sometimes be advantageous, mostly because memory use is so expensive. One load or store into VRAM can be as expensive as 10 or sometimes even 100 math instructions, so if your store & load takes 40 cycles, and recomputing something takes 25 cycles of math using registers, then recomputing can be faster.
I second the sibling recommendation to learn numpy, it’s a different way of thinking than single-threaded functional programming with lists & maps. Try writing some kind of image filter in Python both ways, and get a feel for the performance difference. If you’re familiar with Python, this is a one or two hour exercise. Last time I tried it, my numpy version was ~2 orders of magnitude faster than the lists & maps version.
One of the most fun ways to learn SIMD programming, in my humble opinion, is to study the shaders on ShaderToy. ShaderToy makes it super simple to write GPU code and see the result. Some of the tricks people use are very clever, but after studying them for a while and trying a few yourself, you’ll start to see themes emerge about how to organize parallel image computations.
pools are in dire need of cuda developers
Pools have money; if they need CUDA engineers, they are fully capable of hiring them at the industry rate.