Look at it this way:
1. Most software today is I/O bound, which is when programmers don't (and shouldn't care) about shared memory parallelism.
2. Most popular programming languages today are based on classic imperative programming going back Fortran and co.
3. These classic languages suggest using loops.
4. Loops are inherently not parallelizable. Only in a specific case where there is no carried on dependency, a loop becomes parallelizable.
5. These languages have basically infested everything we do, including compute / memory bandwidth bound problems that should now be treated in parallel. (Even bandwidth goes down with sequential execution, for example on Intel sockets usually by a factor of 2).
6. Since therefore most parallel things get written in loops, this becomes a hard problem. What the compiler vendors are doing is usually flinging [directives](http://www.openacc.org/) [at us](http://openmp.org/wp/).
7. Directives work well until they don't and you have no idea why, because it's usually a black box for most programmers (who can't read assembly like code).
8. What we should get is a way of saying: Here is some scalar code. I'd like this code to be applied in parallel over domains X, Y, Z etc. Be aware of symbols alpha and beta that are dependant in X, Y, Z as well as gamma that is dependant in X only.
10. This should be available at language level, so programmers start thinking in these terms. Only then do we have a reliable way of making use of data parallelism.
11. CUDA and OpenCL are actually pretty close to this, but slightly too low level and generally thought as being hard to program in (which I don't agree, but that's the image).
Disclaimer: I've been involved in this problem space since some time and [this](https://github.com/muellermichel/Hybrid-Fortran) is what has come out of it. It's HPC targeted, but at some point I'd like to make this whole parallel computing thing more generally approachable.