Here's one I had: I was trying to build a Bloom filter in parallel. Each thread had large-ish batches of hashes it wanted to insert into the filter. Naively, you'd just have each thread iterate through the batches and do __sync_fetch_and_or for each of the hashes (this was a register-blocked Bloom filter so we only needed to perform one 8-byte or operation per hash).
What ended up being MUCH faster was to partition the filter, and to have a lock per partition. Each thread would attempt to grab a random lock, perform its inserts into that partition, then release the lock and try to grab another random lock that it hasn't grabbed yet. Granted, these locks were just atomic booleans, not std::mutex or anything like that. But I think this illustrates that partitioning+locking can be better for throughput. If you want predictable latency for single inserts, then I'd imagine the __sync_fetch_and_or strategy would work better. Which maybe brings up a broader point that this whole discussion relies a lot on exactly what "faster" means to you.