(def kernel-source
"extern \"C\"
__global__ void increment (int n, float *a) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
a[i] = a[i] + 1.0f;
}
};")
Is there an example where the kernel is written in Clojure?The library you linked doesn't compile clojure code to run on the GPU. It's basically just a FFI wrapper to pass C/C++ kernels to CUDA.
Really hoping we see some movement on CLArrays in the near future.
I think the reason CLArrays.jl only claims experimental support on v0.6 is because it uses Transpiler.jl which is quite limited and a direct translation. CUDANative.jl is much more advanced and uses LLVM's .ptx backend. I think Simon mentioned he wanted to do a more direct compilation route, and that would probably be the change to start calling it experimental? I don't know, I'd like to hear about the future of CLArrays as well. If we have both working strong, boy oh boy I'll be happy.
> For this article I'm going to choose CuArrays, since this article is written for Julia 0.7 / 1.0, which still isn't supported by CLArrays.
OpenCL has been improved since then, but now it is too late.
GPU threads got even more independent in nvidia'a volta architecture - search Volta ITS for details.
I'm happy that Julia supports GPU programming for simple code, but I don't see how you can run algorithms with inter-thread communication.
The thing to note about GPU programming is that the vast majority of overhead comes from data transfer. Sometimes, it is net faster to do the computation on the CPU, if your data set and data results are very large, even if the GPU performs each calculations faster on average due to parallelism. To illustrate, look at the benchmarks on gpgpu.js running a simple kernel:
CPU: 6851.25ms
GPU Total: 1449.29ms
GPU Execution: 30.64ms
GPU IO: 1418.65ms
Theoretical Speedup: 223.59x
Actual Speedup: 4.73x
The theoretical speedup excludes data transfer while actual speedup includes it. The longer you can keep your data set on the GPU to do more calculations (avoiding back and forth IO), the bigger your net speed gains are.I remember OpenCL for Firefox being discussed and discarded in favor of compute shaders about 7 years ago, and then when WebGL 2.0 was finally released 5 years later, compute shaders were not part of it.
Additionaly, SIMD.js has been developed and then killed again and support for SIMD in WebAssembly has been delayed, so I don't believe that we'll be able to implement efficient numerical methods in the browser any time soon.
If you've played with compute shaders (or any of the modern "general purpose" shader stuff, ie arbitrary loads & stores etc) you probably know that it's quite easy to crash drivers. Of course you do so generally by provoking some form of UB (although not always, their runtime/compilers are far from bug-free).
But WebGL can't have that, so I don't see how they could pull that off without adding a ton of runtime bound checks to shaders, like they do for index buffers right now but on the GPU side this time.
Not only would that be bad for performance, but I still would never trust this whole stack to run arbitrary code.
However it doesn't win many hearts outside HPC nowadays even with the latest standard revisions, and I was thinking more about mainstream adoption.
There are also Accelerate, CUDA4J, ClojureCUDA, Hybridizer and Alea, but I am not sure about their adoption at large.
import "lib/github.com/diku-dk/complex/complex"
module c32 = mk_complex f32
type c32 = c32.complex
let juliaset (maxiter: i32) (z0: c32) =
let c = c32.mk (-0.5) 0.75
let (_, i) = loop (z, i) = (z0, 0) while i < maxiter && c32.mag z <= 4 do
(c32.(z * z + c), i+1)
in i
let main (n: i32) (maxiter: i32) =
let N = 2**n
let (w, h) = (N, N)
let q = tabulate_2d w h (\i j -> let i = 1 - r32 i*(2/r32 w)
let j = -1.5 + r32 j*(3/r32 h)
in c32.mk i j)
in map (map (juliaset maxiter)) q
I don't have the tools or skills to easily generate nice graphs, but I get about a x271 speedup when this code is compiled to OpenCL and run on a Vega 64 GPU, versus when it is compiled to C. The C code runs in 272ms, which is very close to the number reported for Julia here[0] (I assume that the page has a typo and that the N=2²⁴ column actually means N=2¹², because an N=2²⁴ image would take up dozens of TiB in GPU memory. Also, the benchmark code linked only goes up to N=2¹².)Unfortunately, if I change maxiters to 256, the speedup actually drops, to x182. So much for that theory.
Edit: also tried on an NVIDIA GTX 780Ti, and the results are the same. My new theory is that the Julia code also counts the cost of moving the resulting array back to the host (which can easily be dominant for a kernel that runs for just a millisecond or two).
[0]: https://github.com/JuliaGPU/GPUBenchmarks.jl/blob/master/res...
Every time I read about Julia, I’m amazed. What a game changing tool.