templates are not something that (outside of simple uses) you'd want to use in your kernels. regardless, nvidia should be pushing their improvements through to OpenCL by exposing extensions. they might get adopted into the core profile.
> What bugs me about OpenCL is the intentional vagueness of the specification that gives every implementer the freedom to do whatever they want with the result that performance portability is often difficult to achieve.
well, that flexibility is required for OpenCL to be meaningful. that's where the variation in the hardware platforms exists. it's what differentiates compute devices. if that vagueness wasn't there, then we couldn't have things like OpenCL on FPGAs (altera, xilinx)
as for your statement on performance portability, perhaps that is an issue (but that's entirely dependent on the type of problem you're trying to compute). but something i don't understand is this;
you could have picked a proprietary API to do your compute. but say you choose CL. you optimize for your hardware, then what do you know - it's not really that fast on other hardware. but you're entirely overlooking the biggest boon here - your code ran on the other hardware in the first place. getting performant code is now only a matter of optimizing for that piece of hardware.
you could argue that's entirely too complicated, but that's what we have been doing already with our regular C/C++ programs (SSE/AVX/SMP...)