Eg you have 100M rows and you want to cluster them by a distance function (naively), running dist(arr[i], arr[j]) is crazy fast, the problem is just that you have so many of them. It is faster to run it on one core than dispatch it from one queue to multiple cores, but best to assign the work ahead of time to n cores and have them crunch the numbers.