CUDA and OpenCL run time comparison using Vector Addition (opens in new tab)

(researchdaily.blogspot.com)

17 pointsastroguy16y ago6 comments

6 comments

5 comments · 2 top-level

wtallis16y ago· 3 in thread

What exactly were you trying to accomplish with this post? You haven't particularly enlightened anybody about the relative performance characteristics of OpenCL and CUDA:

The graph shown illustrates nearly identical scaling behavior between OpenCL and CUDA, with OpenCL taking consistently about half a second longer, except at the last data point where it takes a full second longer. In other words, all we're really seeing here is initialization overhead, and without seeing the source code we can't know if this is at all a fair test. It may be that the CUDA runtime is performing more compilation work before the clock starts, or that it is caching the compiled kernels.

Based solely on the last data point, it looks like there might be a widening performance gap for larger jobs, but you don't include any larger test cases. The longest run times listed are 7 second jobs, so clearly it wouldn't have been that time-consuming to run some larger test cases. You could also easily use a logarithmic scale on the time axis. (And please fix the misleading horizontal axis!)

Additionally, you state that your methodology was to run each job 5 times and take the lowest time. As I mentioned above, there could be caching in the runtime or operating system that makes this a poor real-world test. (OpenCL and AFAIK CUDA allows the programmer to keep a compiled form of a kernel in memory, so if a program has to run a kernel many times, you can ensure that you aren't needlessly re-compiling your kernels. CUDA may be doing this automatically.) It would be better to report the average times, or even to just show all of them on the graph and have the lines go through the average times.

And to top it off, the test was simply of doing lots of vector addition. A perfect case of a microbenchmark that doesn't provide useful predictions of real-world performance. Addition is so simple that this is at best a test of memory bandwidth, not computational power. It would have been more useful to test something like an FFT if the intent was to measure efficiency at computation.

Also, any good benchmark include hardware and software specs. Were these test run on a CPU or a GPU? What driver versions were involved, and in particular, were either implementations beta-releases?

The subjective comments seem like they could be worth discussion, but only with some elaboration first. One can presume that any real-world performance differences between CUDA and OpenCL on the same hardware would be due to the relative immaturity of the OpenCL implementation, or a deliberate choice by Nvidia to make OpenCL look bad. However, given that neither system is particularly entrenched, a debate about the design aspects could be very useful.

It sounds like OpenCL has a better workflow (with the exception of debugging), but that the actual code is uglier. I'd be very interested to see comparisons of the kernel code in each language, and a separate comparison of the set-up code needed to run those kernels. (Although I'm a fan of Andreas Klöckner's Python bindings, which greatly simplify the set-up code.) My impression of OpenCL's kernel language has been that it seems seems well-designed, so you comment that the kernel code is ugly surprises me. Have I misunderstood CUDA as a low-level framework offering essentially the same kind of access to the hardware? I can't tell yet from what I've read from various sources whether CUDA has an appreciably better design, or whether it is simply benefiting from the biases of people who learned CUDA first and have been using it longer.

jakozaur16y ago

In CUDA kernel to PTX compilation is performed when the project is build. In OpenCL it is performed at runtime. That is "initialization overhead" which cause result to be different. You could use cache to avoid it.

So the reason of the performance differences is caused by the OpenCL's design. In OpenCL it is unknown on which hardware the kernel will run, so the compilation is postpone to runtime. I could hardly see any reason why NVIDIA would want to make OpenCL slower.

There are two major differences between CUDA and OpenCL: -OpenCL is industry standard while CUDA is NVIDIA's platform -OpenCL is a regular C library, CUDA in addition to that is extension to C language

CUDA kernel code will be much nicer, it will takes less lines and looks much better. The cost is it requires a special compiler (nvcc).

Example CUDA code: (run a kernel) myCudaKernel<<< grid, block, sharedMemorySize >>> (... arguments);

Equivalent in OpenCL: clSetKernelArg (...); // for each kernel argument! ... clEnqueueNDRangeKernel (...); // run kernel

andrewcooke16y ago

the last part above is not clear.

the code referred to as "kernel code" is actually code that runs on the client to invoke the kernel. opencl uses a library api, which cuda takes a dsl approach. so opencl is more verbose.

as far as i know, the actual kernel code (which describes what happens on the gpu) is pretty similar.

1 more reply

astroguyOP16y ago

Hi wtallis, Thank you for your comments. Very helpful for my future research articles.

CUDA Driver = CUDART, CUDA Driver Version = 4247589, CUDA Runtime Version = 3.0, Device = Tesla C1060

Please check http://pastebin.com/m73aba293 for source code

astroguyOP16y ago

For source code please check this http://pastebin.com/m73aba293

j / k navigate · click thread line to collapse

6 comments

5 comments · 2 top-level

wtallis16y ago· 3 in thread

What exactly were you trying to accomplish with this post? You haven't particularly enlightened anybody about the relative performance characteristics of OpenCL and CUDA:

Also, any good benchmark include hardware and software specs. Were these test run on a CPU or a GPU? What driver versions were involved, and in particular, were either implementations beta-releases?

jakozaur16y ago

CUDA kernel code will be much nicer, it will takes less lines and looks much better. The cost is it requires a special compiler (nvcc).

Example CUDA code: (run a kernel) myCudaKernel<<< grid, block, sharedMemorySize >>> (... arguments);

Equivalent in OpenCL: clSetKernelArg (...); // for each kernel argument! ... clEnqueueNDRangeKernel (...); // run kernel

andrewcooke16y ago

the last part above is not clear.

the code referred to as "kernel code" is actually code that runs on the client to invoke the kernel. opencl uses a library api, which cuda takes a dsl approach. so opencl is more verbose.

as far as i know, the actual kernel code (which describes what happens on the gpu) is pretty similar.

1 more reply

astroguyOP16y ago

Hi wtallis, Thank you for your comments. Very helpful for my future research articles.

CUDA Driver = CUDART, CUDA Driver Version = 4247589, CUDA Runtime Version = 3.0, Device = Tesla C1060

Please check http://pastebin.com/m73aba293 for source code

astroguyOP16y ago

For source code please check this http://pastebin.com/m73aba293

j / k navigate · click thread line to collapse