An Introduction to GPU Programming in Julia (opens in new tab)

(nextjournal.com)

290 pointssimondanisch7y ago46 comments

46 comments

33 comments · 8 top-level

currymj7y ago· 5 in thread

While having a Torch-esque GPU ndarray is great, the ability to easily write your own kernels without having to compile gnarly C++ code is what sets Julia apart from competitors IMO. Not sure if there's any other dynamic language offering anything like this.

schmudde7y ago

ClojureCL has already been mentioned, but I found the author's interview on the Defn podcast particularly enlightening: https://soundcloud.com/defn-771544745/defn-24-v3

dragandj7y ago

ClojureCL and ClojureCUDA have been doing it for 4 years now.

https://clojurecl.uncomplicate.org

https://clojurecuda.uncomplicate.org

KenoFischer7y ago

Am I missing something? I was excited about other languages doing the same as Julia, but the first example for ClojureCUDA has

    (def kernel-source
          "extern \"C\"
             __global__ void increment (int n, float *a) {
                   int i = blockIdx.x * blockDim.x + threadIdx.x;
               if (i < n) {
                  a[i] = a[i] + 1.0f;
                }
           };")

Is there an example where the kernel is written in Clojure?

1 more reply

pjmlp7y ago

"Parallel computations on the GPU with CUDA in Clojure"

https://clojurecuda.uncomplicate.org/

imtringued7y ago

The article is about compiling a restricted subset of Julia code to run on a GPU and accelerate your project without leaving the language.

The library you linked doesn't compile clojure code to run on the GPU. It's basically just a FFI wrapper to pass C/C++ kernels to CUDA.

1 more reply

eigenspace7y ago· 5 in thread

It seems kinda weird to tout how great it is that we have CuArrays and CLArrays when CLArrays haven't been updated for 1.0 and only claims experimental support for 0.6.

Really hoping we see some movement on CLArrays in the near future.

ChrisRackauckas7y ago

I want to see it updated too. It was really nice to write/test/debug GPU codes on my cheap laptop without an NVIDIA GPU, and then switch over to my desktop and CUDA with single line changes. Even though CuArrays tends to be quite a bit faster, this convenience cannot be overstated. I didn't realize how much I'd miss it until I was at a conference and couldn't just fiddle around with some GPU integration I had to debug.

I think the reason CLArrays.jl only claims experimental support on v0.6 is because it uses Transpiler.jl which is quite limited and a direct translation. CUDANative.jl is much more advanced and uses LLVM's .ptx backend. I think Simon mentioned he wanted to do a more direct compilation route, and that would probably be the change to start calling it experimental? I don't know, I'd like to hear about the future of CLArrays as well. If we have both working strong, boy oh boy I'll be happy.

keldaris7y ago

It is clearly mentioned in a separate paragraph which reads:

> For this article I'm going to choose CuArrays, since this article is written for Julia 0.7 / 1.0, which still isn't supported by CLArrays.

eigenspace7y ago

My bad, I should really finish reading things before I comment.

boromi7y ago

sadly OpenCL seems to always get ignored

pjmlp7y ago

They did that to themselves by sticking originally with C, and forcing the whole read text, compile, link process at runtime, instead of the multiple language support from CUDA.

OpenCL has been improved since then, but now it is too late.

IshKebab7y ago· 5 in thread

It doesn't really describe the fundamental difference between a GPU and a 4000-core CPU, which is that the GPU has a shared program counter. All the cores must execute the same instruction at each cycle.

twtw7y ago

The conceptually independent threads are executed on 32 wide SIMD cores. "All the cores" within a warp/wavefront must execute the same instruction each cycle.

GPU threads got even more independent in nvidia'a volta architecture - search Volta ITS for details.

why_only_157y ago

Are you sure? I'm pretty sure I can branch on core-specific measures like (in CUDA) TheadIdx and BlockIdx

gpuhacker7y ago

It's true that the programming model allows that, but underneath all threads within the warp will execute the same instructions. However if there is a branch, some threads can be predicated so their instructions have no effect. This is called warp divergence and can become a performance issue. If possible branch only on threadidx using multiples of the warp size. There's a cool slide deck on implementing a parallel sum algorithm that explains this really well.

1 more reply

0-_-07y ago

Yes but can you do that from Julia without writing CUDA/OpenCL code?

I'm happy that Julia supports GPU programming for simple code, but I don't see how you can run algorithms with inter-thread communication.

2 more replies

llukas7y ago

On NVIDIA Volta architecture threads have individual PC.

daenz7y ago· 4 in thread

GPGPU (general purpose gpu) programming is pretty cool. I wrote a utility to let you do it in javascript, in the browser, awhile back https://github.com/amoffat/gpgpu.js

The thing to note about GPU programming is that the vast majority of overhead comes from data transfer. Sometimes, it is net faster to do the computation on the CPU, if your data set and data results are very large, even if the GPU performs each calculations faster on average due to parallelism. To illustrate, look at the benchmarks on gpgpu.js running a simple kernel:

  CPU: 6851.25ms
  GPU Total: 1449.29ms
  GPU Execution: 30.64ms
  GPU IO: 1418.65ms
  Theoretical Speedup: 223.59x
  Actual Speedup: 4.73x

The theoretical speedup excludes data transfer while actual speedup includes it. The longer you can keep your data set on the GPU to do more calculations (avoiding back and forth IO), the bigger your net speed gains are.

johndough7y ago

Note that `gl.finish` is often implemented as non-blocking in WebGL, so measuring query submission time on the host (cpu) does not necessarily reflect execution time on the device (gpu). It is more accurate to use timer queries. For more info see: https://stackoverflow.com/questions/20798294/is-it-possible-...

pjmlp7y ago

WebGL compute shaders are being discussed, it might be a possible improvement to your library when they finally get adopted.

johndough7y ago

Are there any news on WebGL compute shaders?

I remember OpenCL for Firefox being discussed and discarded in favor of compute shaders about 7 years ago, and then when WebGL 2.0 was finally released 5 years later, compute shaders were not part of it.

Additionaly, SIMD.js has been developed and then killed again and support for SIMD in WebAssembly has been delayed, so I don't believe that we'll be able to implement efficient numerical methods in the browser any time soon.

2 more replies

obl7y ago

I wonder how they are going to pull that off without security implications.

If you've played with compute shaders (or any of the modern "general purpose" shader stuff, ie arbitrary loads & stores etc) you probably know that it's quite easy to crash drivers. Of course you do so generally by provoking some form of UB (although not always, their runtime/compilers are far from bug-free).

But WebGL can't have that, so I don't see how they could pull that off without adding a ton of runtime bound checks to shaders, like they do for index buffers right now but on the GPU side this time.

Not only would that be bad for performance, but I still would never trust this whole stack to run arbitrary code.

1 more reply

pjmlp7y ago· 3 in thread

Love it! So much more fun than being stuck with C derived languages for GPGPU programming.

tanderson927y ago

You can do GPGPU programming with Fortran, which is definitely not C-derived, predating C by at least 20 years.

3rdAccount7y ago

I find Fortran pretty pleasant for scientific work (what it was designed for) and take pleasure in knowing it hasn't changed a whole lot over the years F70, F90...etc.

pjmlp7y ago

Sure, and it is a good alternative option that I forgot about, given that it is anyway its strong point, numerical computing.

However it doesn't win many hearts outside HPC nowadays even with the latest standard revisions, and I was thinking more about mainstream adoption.

There are also Accelerate, CUDA4J, ClojureCUDA, Hybridizer and Alea, but I am not sure about their adoption at large.

Athas7y ago· 2 in thread

I'm a bit surprised to see that GPU Mandelbrot is only at best x75 faster than (sequential?) CPU. Does Julia just generate really fast (multicore/vectorized?) CPU code? Does it also count communication costs? Fractal computations like that are extremely GPU friendly because they involve no memory accesses at all, except for writing the final result. I would expect at least two orders of magnitude improvement over a straightforwardly written C implementation.

Athas7y ago

I had a suspicion that the problem was that maxiters was set fairly low (16). For the Mandelbrot sets I'm used to, this would result in relatively little computation (over a hundred iterations is more common). To investigate this hunch, I wrote a Futhark program as a proxy for a hand-written GPU implementation (didn't quite have the patience for that this evening):

    import "lib/github.com/diku-dk/complex/complex"

    module c32 = mk_complex f32
    type c32 = c32.complex

    let juliaset (maxiter: i32) (z0: c32) =
      let c = c32.mk (-0.5) 0.75
      let (_, i) = loop (z, i) = (z0, 0) while i < maxiter && c32.mag z <= 4 do
                   (c32.(z * z + c), i+1)
      in i

    let main (n: i32) (maxiter: i32) =
      let N = 2**n
      let (w, h) = (N, N)
      let q = tabulate_2d w h (\i j -> let i = 1 - r32 i*(2/r32 w)
                                       let j = -1.5 + r32 j*(3/r32 h)
                                       in c32.mk i j)
      in map (map (juliaset maxiter)) q

I don't have the tools or skills to easily generate nice graphs, but I get about a x271 speedup when this code is compiled to OpenCL and run on a Vega 64 GPU, versus when it is compiled to C. The C code runs in 272ms, which is very close to the number reported for Julia here[0] (I assume that the page has a typo and that the N=2²⁴ column actually means N=2¹², because an N=2²⁴ image would take up dozens of TiB in GPU memory. Also, the benchmark code linked only goes up to N=2¹².)

Unfortunately, if I change maxiters to 256, the speedup actually drops, to x182. So much for that theory.

Edit: also tried on an NVIDIA GTX 780Ti, and the results are the same. My new theory is that the Julia code also counts the cost of moving the resulting array back to the host (which can easily be dominant for a kernel that runs for just a millisecond or two).

[0]: https://github.com/JuliaGPU/GPUBenchmarks.jl/blob/master/res...

ummonk7y ago

It does autovectorize quite often.

eghad7y ago· 1 in thread

If anyone wants to try out a free GPU using Google Colab/Jupyter (K80, you might run into ram allocation issues if you're not one of the lucky users who get to use the full amount) here's a quick guide to get a Julia kernel up and running: https://discourse.julialang.org/t/julia-on-google-colab-free...

simondanischOP7y ago

Or just directly edit the article ;) That will also give you a K80 and you can directly run the code inside the article

maxbrunsfeld7y ago

> GPUArrays never had to implement automatic differentiation explicitly to support the backward pass of the neuronal network efficiently. This is because Julia's automatic differentiation libraries work for arbitrary functions and emit code that can run efficiently on the GPU. This helps a lot to get Flux working on the GPU with minimal developer effort - and makes Flux GPU support work efficiently even for user defined functions. That this works out of the box without coordination between GPUArrays + Flux is a pretty unique property of Julia

Every time I read about Julia, I’m amazed. What a game changing tool.

j / k navigate · click thread line to collapse

46 comments

33 comments · 8 top-level

currymj7y ago· 5 in thread

schmudde7y ago

ClojureCL has already been mentioned, but I found the author's interview on the Defn podcast particularly enlightening: https://soundcloud.com/defn-771544745/defn-24-v3

dragandj7y ago

ClojureCL and ClojureCUDA have been doing it for 4 years now.

https://clojurecl.uncomplicate.org

https://clojurecuda.uncomplicate.org

KenoFischer7y ago

Am I missing something? I was excited about other languages doing the same as Julia, but the first example for ClojureCUDA has

    (def kernel-source
          "extern \"C\"
             __global__ void increment (int n, float *a) {
                   int i = blockIdx.x * blockDim.x + threadIdx.x;
               if (i < n) {
                  a[i] = a[i] + 1.0f;
                }
           };")

Is there an example where the kernel is written in Clojure?

1 more reply

pjmlp7y ago

"Parallel computations on the GPU with CUDA in Clojure"

https://clojurecuda.uncomplicate.org/

imtringued7y ago

The article is about compiling a restricted subset of Julia code to run on a GPU and accelerate your project without leaving the language.

The library you linked doesn't compile clojure code to run on the GPU. It's basically just a FFI wrapper to pass C/C++ kernels to CUDA.

1 more reply

eigenspace7y ago· 5 in thread

It seems kinda weird to tout how great it is that we have CuArrays and CLArrays when CLArrays haven't been updated for 1.0 and only claims experimental support for 0.6.

Really hoping we see some movement on CLArrays in the near future.

ChrisRackauckas7y ago

keldaris7y ago

It is clearly mentioned in a separate paragraph which reads:

> For this article I'm going to choose CuArrays, since this article is written for Julia 0.7 / 1.0, which still isn't supported by CLArrays.

eigenspace7y ago

My bad, I should really finish reading things before I comment.

boromi7y ago

sadly OpenCL seems to always get ignored

pjmlp7y ago

They did that to themselves by sticking originally with C, and forcing the whole read text, compile, link process at runtime, instead of the multiple language support from CUDA.

OpenCL has been improved since then, but now it is too late.

IshKebab7y ago· 5 in thread

twtw7y ago

The conceptually independent threads are executed on 32 wide SIMD cores. "All the cores" within a warp/wavefront must execute the same instruction each cycle.

GPU threads got even more independent in nvidia'a volta architecture - search Volta ITS for details.

why_only_157y ago

Are you sure? I'm pretty sure I can branch on core-specific measures like (in CUDA) TheadIdx and BlockIdx

gpuhacker7y ago

1 more reply

0-_-07y ago

Yes but can you do that from Julia without writing CUDA/OpenCL code?

I'm happy that Julia supports GPU programming for simple code, but I don't see how you can run algorithms with inter-thread communication.

2 more replies

llukas7y ago

On NVIDIA Volta architecture threads have individual PC.

daenz7y ago· 4 in thread

GPGPU (general purpose gpu) programming is pretty cool. I wrote a utility to let you do it in javascript, in the browser, awhile back https://github.com/amoffat/gpgpu.js

  CPU: 6851.25ms
  GPU Total: 1449.29ms
  GPU Execution: 30.64ms
  GPU IO: 1418.65ms
  Theoretical Speedup: 223.59x
  Actual Speedup: 4.73x

johndough7y ago

pjmlp7y ago

WebGL compute shaders are being discussed, it might be a possible improvement to your library when they finally get adopted.

johndough7y ago

Are there any news on WebGL compute shaders?

2 more replies

obl7y ago

I wonder how they are going to pull that off without security implications.

But WebGL can't have that, so I don't see how they could pull that off without adding a ton of runtime bound checks to shaders, like they do for index buffers right now but on the GPU side this time.

Not only would that be bad for performance, but I still would never trust this whole stack to run arbitrary code.

1 more reply

pjmlp7y ago· 3 in thread

Love it! So much more fun than being stuck with C derived languages for GPGPU programming.

tanderson927y ago

You can do GPGPU programming with Fortran, which is definitely not C-derived, predating C by at least 20 years.

3rdAccount7y ago

I find Fortran pretty pleasant for scientific work (what it was designed for) and take pleasure in knowing it hasn't changed a whole lot over the years F70, F90...etc.

pjmlp7y ago

Sure, and it is a good alternative option that I forgot about, given that it is anyway its strong point, numerical computing.

However it doesn't win many hearts outside HPC nowadays even with the latest standard revisions, and I was thinking more about mainstream adoption.

There are also Accelerate, CUDA4J, ClojureCUDA, Hybridizer and Alea, but I am not sure about their adoption at large.

Athas7y ago· 2 in thread

Athas7y ago

    import "lib/github.com/diku-dk/complex/complex"

    module c32 = mk_complex f32
    type c32 = c32.complex

    let juliaset (maxiter: i32) (z0: c32) =
      let c = c32.mk (-0.5) 0.75
      let (_, i) = loop (z, i) = (z0, 0) while i < maxiter && c32.mag z <= 4 do
                   (c32.(z * z + c), i+1)
      in i

    let main (n: i32) (maxiter: i32) =
      let N = 2**n
      let (w, h) = (N, N)
      let q = tabulate_2d w h (\i j -> let i = 1 - r32 i*(2/r32 w)
                                       let j = -1.5 + r32 j*(3/r32 h)
                                       in c32.mk i j)
      in map (map (juliaset maxiter)) q

Unfortunately, if I change maxiters to 256, the speedup actually drops, to x182. So much for that theory.

[0]: https://github.com/JuliaGPU/GPUBenchmarks.jl/blob/master/res...

ummonk7y ago

It does autovectorize quite often.

eghad7y ago· 1 in thread

simondanischOP7y ago

Or just directly edit the article ;) That will also give you a K80 and you can directly run the code inside the article

maxbrunsfeld7y ago

Every time I read about Julia, I’m amazed. What a game changing tool.

j / k navigate · click thread line to collapse