Hmm, maybe they should.
In the paper you linked they don't seem to be able to perform a priority queue op in under about 1 us. Even when there are only 15 threads running on an exotic 29-core machine, they can only handle 750e3 ops/second (with no workload on each op). (This is consistent with a benchmark I did on Boost.ASIO a while back.
So the tasks should be sized to take at least (thread cnt)* 9 us to complete to keep the overhead from the priority queue overhead under about 10%. The best cases for the lock free algorithm might let you bring that down to about 3 us in theory. I'm not sure that justifies the additional complexity.
This survey paper http://www.zurich.ibm.com/pdf/sys/adv_messaging/shortConcurr... says a lot of stuff and then at the end Figure 4 shows a simple coarse-locked single-threaded structure providing double the performance when there is actual load involved and significant preemption. Which is the opposite of what I'd expected, which would be the occasional preemption of the thread holding the coarse lock would cause a slowdown.