The graph shown illustrates nearly identical scaling behavior between OpenCL and CUDA, with OpenCL taking consistently about half a second longer, except at the last data point where it takes a full second longer. In other words, all we're really seeing here is initialization overhead, and without seeing the source code we can't know if this is at all a fair test. It may be that the CUDA runtime is performing more compilation work before the clock starts, or that it is caching the compiled kernels.
Based solely on the last data point, it looks like there might be a widening performance gap for larger jobs, but you don't include any larger test cases. The longest run times listed are 7 second jobs, so clearly it wouldn't have been that time-consuming to run some larger test cases. You could also easily use a logarithmic scale on the time axis. (And please fix the misleading horizontal axis!)
Additionally, you state that your methodology was to run each job 5 times and take the lowest time. As I mentioned above, there could be caching in the runtime or operating system that makes this a poor real-world test. (OpenCL and AFAIK CUDA allows the programmer to keep a compiled form of a kernel in memory, so if a program has to run a kernel many times, you can ensure that you aren't needlessly re-compiling your kernels. CUDA may be doing this automatically.) It would be better to report the average times, or even to just show all of them on the graph and have the lines go through the average times.
And to top it off, the test was simply of doing lots of vector addition. A perfect case of a microbenchmark that doesn't provide useful predictions of real-world performance. Addition is so simple that this is at best a test of memory bandwidth, not computational power. It would have been more useful to test something like an FFT if the intent was to measure efficiency at computation.
Also, any good benchmark include hardware and software specs. Were these test run on a CPU or a GPU? What driver versions were involved, and in particular, were either implementations beta-releases?
The subjective comments seem like they could be worth discussion, but only with some elaboration first. One can presume that any real-world performance differences between CUDA and OpenCL on the same hardware would be due to the relative immaturity of the OpenCL implementation, or a deliberate choice by Nvidia to make OpenCL look bad. However, given that neither system is particularly entrenched, a debate about the design aspects could be very useful.
It sounds like OpenCL has a better workflow (with the exception of debugging), but that the actual code is uglier. I'd be very interested to see comparisons of the kernel code in each language, and a separate comparison of the set-up code needed to run those kernels. (Although I'm a fan of Andreas Klöckner's Python bindings, which greatly simplify the set-up code.) My impression of OpenCL's kernel language has been that it seems seems well-designed, so you comment that the kernel code is ugly surprises me. Have I misunderstood CUDA as a low-level framework offering essentially the same kind of access to the hardware? I can't tell yet from what I've read from various sources whether CUDA has an appreciably better design, or whether it is simply benefiting from the biases of people who learned CUDA first and have been using it longer.
So the reason of the performance differences is caused by the OpenCL's design. In OpenCL it is unknown on which hardware the kernel will run, so the compilation is postpone to runtime. I could hardly see any reason why NVIDIA would want to make OpenCL slower.
There are two major differences between CUDA and OpenCL: -OpenCL is industry standard while CUDA is NVIDIA's platform -OpenCL is a regular C library, CUDA in addition to that is extension to C language
CUDA kernel code will be much nicer, it will takes less lines and looks much better. The cost is it requires a special compiler (nvcc).
Example CUDA code: (run a kernel) myCudaKernel<<< grid, block, sharedMemorySize >>> (... arguments);
Equivalent in OpenCL: clSetKernelArg (...); // for each kernel argument! ... clEnqueueNDRangeKernel (...); // run kernel
the code referred to as "kernel code" is actually code that runs on the client to invoke the kernel. opencl uses a library api, which cuda takes a dsl approach. so opencl is more verbose.
as far as i know, the actual kernel code (which describes what happens on the gpu) is pretty similar.
CUDA Driver = CUDART, CUDA Driver Version = 4247589, CUDA Runtime Version = 3.0, Device = Tesla C1060
Please check http://pastebin.com/m73aba293 for source code