Actor architectures are only useful for task parallelism which no one really knows how to get much out of; definitely not the close-to-linear performance benefits we can get from data parallelism. Task parallelism is much better for when you have to do multiple things at once (more efficient concurrency), not for when you want to make a sequential task faster.
Maybe this will help
http://jlouisramblings.blogspot.com/2011/07/erlangs-parallel...
and
I'm using Erlang and GPU programming each for its area of expertise. FWIW I even use both together via https://github.com/tonyrog/cl
Erlang is great at asynchronous concurrency which happens to be able to run in parallel well because of how the VM is built.
GPU's solve totally different problems
It should also be clear that task parallelism (or concurrency from your perspective) has not had the benefit of billions of engineer-hours focused on improving its performance. It is within recent memory that if you wanted 20+ CPUs at your disposal, you'd have to build a cluster with explicit job management, topologically-optimized communications, and a fair amount of physical redundancy.
As many of the applications requiring low-end clusters tended to involve random numbers or floating point calculations, we also had the annoyance of minor discrepancies such as clock drift affecting the final output. This would present, for example, in a proportional percentage of video frames with conspicuously different coloration.
> It is within recent memory that if you wanted 20+ CPUs at your disposal, you'd have to build a cluster with explicit job management, topologically-optimized communications, and a fair amount of physical redundancy.
You are still thinking about concurrency, not parallelism. Yes, the cluster people had to think this way, they were interested in performance for processing many jobs; no the HPC people who needed performance never thought like this, they were only interested in the performance of one job.
> As many of the applications requiring low-end clusters tended to involve random numbers or floating point calculations, we also had the annoyance of minor discrepancies such as clock drift affecting the final output.
Part of the problem, I think, is that we've been confused for a long time. Our PHBs saw problems (say massive video frame processing) and saw solutions that were completely inappropriate for it (cluster computing). Its only recently that we've realized there are often other/better option (like running MapReduce on that cluster).