undefined | Better HN

0 pointsOskarS3y ago0 comments

Assuming low or no contention, it is easy to imagine a scenario where a mutex vastly outperforms it: if you need to push a 1000 things into the queue, it's still just two fences for the mutex but it's now a 1000 CASes.

Moreover: the point with mutexes is that your data structure can be the optimized assuming no thread-safety. There are lots of, like, hyper-optimized hash table variants (with all sorts of SIMD nonsense and stuff) that are just not possible to do lock-free. The very "lock-freedomness" of the datastructure slows it down enough that in low contention scenarios mutexes clearly would outperform them without being particularly contrived.

0 comments

7 comments · 3 top-level

ot3y ago· 3 in thread

If you are going to do batch operations, your data structure should be optimized to support them, so you're back to one CAS. The same would apply to the locked scenario, where you probably don't want to copy 1000 elements in the critical section.

About the sufficiently smart optimizations, sure, everything is easy to imagine, but in my experience this never happened, and I'd be curious to hear practical examples if you have any.

sakras3y ago

Here's one I had: I was trying to build a Bloom filter in parallel. Each thread had large-ish batches of hashes it wanted to insert into the filter. Naively, you'd just have each thread iterate through the batches and do __sync_fetch_and_or for each of the hashes (this was a register-blocked Bloom filter so we only needed to perform one 8-byte or operation per hash).

What ended up being MUCH faster was to partition the filter, and to have a lock per partition. Each thread would attempt to grab a random lock, perform its inserts into that partition, then release the lock and try to grab another random lock that it hasn't grabbed yet. Granted, these locks were just atomic booleans, not std::mutex or anything like that. But I think this illustrates that partitioning+locking can be better for throughput. If you want predictable latency for single inserts, then I'd imagine the __sync_fetch_and_or strategy would work better. Which maybe brings up a broader point that this whole discussion relies a lot on exactly what "faster" means to you.

ot3y ago

This seems to me like a parallel accumulation problem, why not have each thread accumulate a filter on a subset of the data (so no locking involved), and then reduce the results (which is just an OR of all the local accumulations)?

sakras3y ago

Parallel reductions are more heavy-weight synchronizations than locks. Say we have 64 partitions, then we need to perform 6 levels of tree reduction, or avoid parallelism completely and perform the reduction on a single thread. Either way it was slower.

The locking strategy very rarely had any reduction in parallelism due to the randomized lock-taking.

There were also other reasons, such as not wanting to replicate the filter per-thread.

T0pH4t3y ago· 1 in thread

^ This can definitely be the case in a multi-producer/multi-consumer (MPMC) scenario if CAS is involved with loops. Great care has to be taken when writing MPMC data structures without locks that are more performant then lock equivalents; they are far more complex. I think it should be called out that it seems most (if not all) of the data structures provided are single producer/consumer which generally always have much simpler designs (and limitations of use) then MPMC.

T0pH4t3y ago

I could I should also say this applies to MPSC and SPMC. Basically anything other than SPSC.

dnedic3y ago

This is why you would use the Ring Buffer or Bipartite Buffer to place 1000 elements at a time, not the queue. Check the documentation for more info.

j / k navigate · click thread line to collapse

0 comments

7 comments · 3 top-level

ot3y ago· 3 in thread

About the sufficiently smart optimizations, sure, everything is easy to imagine, but in my experience this never happened, and I'd be curious to hear practical examples if you have any.

sakras3y ago

ot3y ago

sakras3y ago

The locking strategy very rarely had any reduction in parallelism due to the randomized lock-taking.

There were also other reasons, such as not wanting to replicate the filter per-thread.

T0pH4t3y ago· 1 in thread

T0pH4t3y ago

I could I should also say this applies to MPSC and SPMC. Basically anything other than SPSC.

dnedic3y ago

This is why you would use the Ring Buffer or Bipartite Buffer to place 1000 elements at a time, not the queue. Check the documentation for more info.

j / k navigate · click thread line to collapse