A collection of lock-free data structures written in standard C++11 (opens in new tab)

(github.com)

158 pointsdnedic3y ago81 comments

81 comments

46 comments · 9 top-level

bluGill3y ago· 17 in thread

From the FAQ:

> The biggest reason you would want to use a lockfree data structure in such a scenario would be performance. Locking has a non-neglegible runtime cost on hosted systems as every lock requires a syscall.

This is misleading. While a lock does have a runtime cost, in some cases that cost is less than all the force CPU cache synchronization calls that lock free needs to do. With a lock you only have to sync once, after all the operations are done. You need to carefully measure this to see which is more performant for your application.

ot3y ago

This is true in principle and it is good calling it out, but in practice I've never seen a mutex-based data structure beat an equivalent lock-free data structure, even at low contention, unless the latter is extremely contrived.

A mutex transaction generally requires 2 fences, one on lock and one on unlock. The one on unlock would not be strictly necessary in principle (on x86 archs the implicit acquire-release semantics would be enough) but you generally do a CAS anyway to atomically check whether there are any waiters that need a wake-up, which implies a fence.

Good lock-free data structures OTOH require just one CAS (or other fenced RMW) on the shared state.

Besides, at large scale, no matter how small your critical section is, it will be preempted every once in a while, and when you care about tail latency that is visible. Lock-free data structures have more predictable latency characteristics (even better if wait-free).

kerkeslager3y ago

I appreciate your polite tone here. To expand on this at the risk of sounding a bit rude: nobody should listen to anyone who speaks about performance in terms of reasoning about a system instead of profiling it.

Computers are shockingly complex. I can't tell you how many times I've reasoned about a system, ran the profiler, and discovered I was completely wrong.

When I was working on an interpreter for a Lisp, I implemented my first cut of scopes (all the variables within a scope and their values) as a naive unsorted list of key/value pairs, thinking I'd optimize later. When I came back to optimize, I reimplemented this as a hashmap, but when I ran my test programs, to my horror, they were all 10x slower. I plugged in a hashmap library used in lots of production systems and got a significant 2x performance gain, which was still slower than looping over an unsorted list of key/value pairs. The fact is, most scopes have <10 variables, and at that size, looping over a list is faster than the constant time of a hashmap. I can reason about why this is, but that's just fitting my reasons to the facts ex-post-facto. Reasoning didn't lead me to the correct answer, observation did.

Returning to parallel data structures, the fact is, I don't know why lock-free structures are faster than mutex-based structures, I just know that they are in every situation where I've profiled them.

Reasoning isn't completely useless--reasoning is how you intuit what you should be profiling. But if you're just reasoning about how two alternatives will perform and not profiling them in real-life production systems you're wasting everyone's time.

2 more replies

OskarS3y ago

Assuming low or no contention, it is easy to imagine a scenario where a mutex vastly outperforms it: if you need to push a 1000 things into the queue, it's still just two fences for the mutex but it's now a 1000 CASes.

Moreover: the point with mutexes is that your data structure can be the optimized assuming no thread-safety. There are lots of, like, hyper-optimized hash table variants (with all sorts of SIMD nonsense and stuff) that are just not possible to do lock-free. The very "lock-freedomness" of the datastructure slows it down enough that in low contention scenarios mutexes clearly would outperform them without being particularly contrived.

4 more replies

josephg3y ago

Are there any good benchmarks which demonstrate the performance characteristics you’re talking about? Or case studies where an application moved from mutexes to lock free data structures, and compared the resulting performance?

2 more replies

dragontamer3y ago

Pikus has a number of C++ Parallelism performance talks that discusses this issue.

In particular, the "cost of synchronization" is roughly the price of a L3 memory read/write, or ~50 cycles. In contrast, a read/write to L1 cache is like 4 cycles latency and can be done multiple times per clock tick, and you can go faster (IE: register space).

So an unsynchronized "counter++" will be done ~4-billion times a second.

But a "counter++; sync()" will slow you down 50x slower. That is to say, the "sync()" is the "expensive part" to think about.

------------

A lock is two sync() statements. One sync() when you lock, and a 2nd sync() when you unlock. Really, half-sync() since one is an acquire half-sync and the other is a release half-sync.

If your lock-free data structure uses more than two sync() statements, you're _probably_ slower than a dumb lock-based method (!!!). This is because the vast majority of lock()/unlocks() are uncontested in practice.

So the TL;DR is, use locks because they're so much easier to think about. When you need more speed, measure first, because it turns out that locks are really damn fast in practice and kind of hard to beat.

That being said, there's other reasons to go lock-free. On some occasions, you will be forced to use lock-free code because blocking could not be tolerated. (Ex: lock-free interrupts. Its fine if the interrupt takes a bit longer than expected during contention). So in this case, even if lock-free is slower, the guarantee for forward progress is what you're going for, rather than performance.

So the study of lock-free data-structures is still really useful. But don't always think of for performance reasons, because these data-structures very well will be slower than std-library + lock() in many cases.

gpderetta3y ago

A sync, assuming it is your typical memory barrier, is not bound by the L3 latency. You pay (in first approximation) the L3 cost when you touch a contended cache line, whether you are doing a plain write or a full atomic CAS.

Separately fences and atomic RMWs are slower than plain read/writes, but that's because of the (partially) serialising effects they have on a CPU pipleline, and very little todo with L3 (or any memory) latency.

Case in point: A CAS on intel is 20ish cycles, the L3 latency is 30-40 cycles or more. On the other hand you can have multiple L3 misses outstanding, but CAS hardly pipelines.

1 more reply

planede3y ago

Moreover, is "every lock requires a syscall" accurate? Probably depends on target platform and standard library, but my impression was that at low contention it doesn't really require syscalls, at least on Linux and glibc's pthread.

kevincox3y ago

Definitely not. I don't think there many standard libraries that use syscalls for every lock. It is near universal to attempt a lock in user space (maybe even spin a few times) and only call the kernel if you need to wait. So low-contention locks should very rarely make a syscall.

1 more reply

m4nu3l3y ago

That's my understanding too. In Linux mutexes are implemented by futexes (Fast User-Space mutexes). If there is no contention they are guaranteed to not perform a syscall https://en.wikipedia.org/wiki/Futex

1 more reply

maldev3y ago

You can literally just do an atomic swap on some memory location. Three lines of assembly, like.

    mov rax, 1  ; load the value to exchange into rax
   acquire_lock:
    ; attempt to acquire the lock
    xchg byte [ADDR_LOCK], al  ; atomically swap the lock value with rax
    test al, al  ; test if the original lock value was 0 (unlocked)
    jnz acquire_lock  ; if it was not, loop until we can acquire the lock

The downside is you want a backoff to sleep the thread so it doesn't go into a loop. But the actual lock code is simple. You can easily have this be your function "AcquireLock()" and then do

while(!AcquireLock()) { //pause thread execution. }

And I think this is where they get the syscall being needed, since this will normally require a syscall to pause the thread from the scheduler.

RcouF1uZ4gsC3y ago

For me the biggest benefit of lock free programming is avoiding deadlocks.

Locks are not composable. Unless you are aware of what locks every function in your call tree is using, you can easily end up with a deadlock.

bcrl3y ago

This is why RCU, aka Read-Copy-Update, exists. RCU avoids the expensive synchronization part by delaying the release of the older version of the data structure until a point at which all CPUs are guaranteed to see the new version of the data. The patents have now expired, so it's worth investigating for people that need to write high performance multithreaded code that hits certain shared data structures really hard.

dnedicOP3y ago

The linked talk from Herb Sutter says as much, but it is a good suggestion to make that visible upfrontm, thanks.

Additionally, cacheline alignment of indexes is something that's there to mitigate the false sharing phenomenom and reduce some of the cache synchronization cost.

ashvardanian3y ago

Both lock-free and mutex-based approaches have their applications. The general rule of thumb, for 2-4 threads in the same NUMA node lock-free is faster. Need more cores? Use a proper heavy mutex.

smat3y ago

I assume the author of the library works under a hard real-time constraint. Under such circumstances (an example would be low latency audio) you can not tolerate the latency impact of a sporadic syscall.

quietbritishjim3y ago

Perhaps, but that is almost the opposite of what they said: in hard real time you can tolerate a longer average latency in return for needing a shorter maximum latency. That matches from what I would expect from a lock free data structure. But that doesn't match the (dubious) claim that locks are usually slower.

1 more reply

duped3y ago

You often can tolerate the latency. The problem of locks is they are potentially unbounded and you can miss a deadline.

It's not about performance so much as determinism.

1 more reply

cjensen3y ago· 8 in thread

An issue in C++ is that it only supports atomic changes to the builtin types. For example, you can only CAS a 64-bit value if your largest integer/pointer type is 64-bits.

Good lock free algorithms use double-width instructions like cmpxchg16b which compare 64-bits but swap 128-bits. You can then use the compared 64-bits as a kind of version number to prevent the a-b-a problem.

Using only the built-in atomics is working with a hand tied behind your back. With the wider version, it's trivial to write multi-producer multi-consumer stacks with no limits to the number of objects stored. It's also pretty easy (if you copy the published algorithm) to do the same with queues.

gpderetta3y ago

Actually C++ only requires TriviallyComparable for std::atomic. The issue with 2CAS is that intel until very recently only provided cmpxchg16b[1] but no 128 atomic load and stores: SSE 128 bit memory operations were not guaranteed to be atomic (and in fact were observed not to be on some AMDs).

So a 128 bit std::atomic on intel was not only suboptimal as the compiler had to use 2cas for load and stores as well, but actually wrong as an atomic load from readonly memory would fault. So at some point the ABI was changed to use the spinlock pool. Not sure if it has changed since.

If you do it "by hand", when you only need a 2cas, a 128 bit load that is not atomic is fine as any tearing will be detected by the CAS and 'fixed', but it is hard for the compiler to optimize generic code.

[1] which actually does full 128bit compare and swap, you are probably confusing it with the Itanium variant.

moonchild3y ago

128-bit aligned loads and stores are guaranteed to be atomic on all intel and amd cpus that support avx. And if your cpu doesn't support avx, it probably doesn't have enough cores that the performance of concurrent data structures matters that much.

ot3y ago

> Not sure if it has changed since.

It hasn't, Clang/GCC emit a cmpxchg16b only if you opt-in with `-mcx16`, which changes the ABI.

aseipp3y ago

The lack of DWCAS as a primitive in general is really annoying, C++ aside. RISC-V's -A extension has no form of it, either; you only get XLEN-sized AMO + LL/SC (no 2*XLEN LL/SC either!)

It's one of those features that when you want it, you really really really want it, and the substitutions are all pretty bad in comparison.

CyberDildonics3y ago

> Good lock free algorithms use double-width instructions like cmpxchg16b which compare 64-bits but swap 128-bits

The instructions should compare 128 bits and swap 128 bits.

I don't know why 'good' algorithms would use these if they don't need to, because 128 bit operations are slower.

Not only that, 128 bit compare and swap doesn't work if it is not 128 bit aligned while 64 bit compare and swap will work even if they aren't 64 bit aligned.

gpderetta3y ago

On x86, any CAS on a misaligned address that crosses a cache line boundary can fault in the best case (if the mis-feature is disabled by the os) or cost thousands of clock cycles on all cores. So it "works" only for small values of "works".

1 more reply

dnedicOP3y ago

True, that would help immensely in creating MPMC data structures, but as these are SPSC there is no problem. Also to clarify, this is only for the indexes, the data members can be anything.

Using these intrinsics or inline assembly would break portability or create situations where platforms have different feature levels, which is not something I intend to do. I want the library to be compatible with everything from a tiny MCU to x86.

hedora3y ago

I've had good luck assuming double-word CAS, portability-wise. Old ARMs have 32 bit pointers, so 64 bit CAS is pretty good. The main problem is that some algorithms go from a bit under 64 bits for a nonce to a bit under 32, which starts to get into "this could hit in practice" territory.

Notbrainiac3y ago· 5 in thread

Every datastructure is lock free. Locks are required when you have multiple writers. The article states the usefull only for certain circumstance: for single consumer single producer scenarios. So yea within these assumptions you can make something work.

Koshkin3y ago

Note that "lock-free" is a technical term that has a very specific, precise meaning (implying that the user of the structure does not need to use locking explicitly).

dnedicOP3y ago

Even in single producer single consumer scenarios you need locks for multithreaded/interrupt use if you're not properly using atomics and proper fences.

jimktrains23y ago

You may need a lock if you have operations that are not atomic, even with a single writer, as a reader could find an inconsistency.

ot3y ago

This seems unnecessarily pedantic. Lock-free conventionally implies concurrent, otherwise it's meaningless.

josephcsible3y ago

Lock-free doesn't mean "doesn't have locks". It means "doesn't need locks to be used concurrently".

whiteboardma3y ago· 3 in thread

I've only looked at the queue implementation, but both push and pop contain obvious race conditions; I would highly suggest adding tests that actually use the data structures from multiple threads.

dnedicOP3y ago

Could you elaborate on the alleged race conditions? Any advice on reliably testing the race conditions? The problem with adding those is the fact that they will give lots of false negatives and if you rely on them you have a problem.

gjadi3y ago

You could use TLA+ to model the data structure operations and check the invariant. Checking the invariant with assert is also useful in my limited experience with concurrency.

https://lamport.azurewebsites.net/tla/tla.html

1 more reply

whiteboardma3y ago

Looking at the Push operation defined in queue_impl.hpp, if multiple threads perform concurrent pushes, they might end up writing their element to the same slot in _data since the current position _w is not incremented atomically

2 more replies

andersced3y ago· 1 in thread

Should be benchmarked against ->

https://github.com/Deaod/spsc_queue

If proven faster OK.. If not.. Well.. back to the drawing board.

I gave it a try -> https://github.com/andersc/fastqueue

Deaod is the kingpin.

jcelerier3y ago

https://max0x7ba.github.io/atomic_queue/html/benchmarks.html for an existing set of benchmarks where this could be added

samsquire3y ago· 1 in thread

I think I lean towards per-thread sharding instead of mutex based or lock free data structures except for lockfree ringbuffers.

You can get embarassingly parallel performance if you split your data by thread and aggregate periodically.

If you need a consistent view of your entire set of data, that is a slow path with sharding.

In my experiments with multithreaded software I simulate a bank where many bankaccounts are randomly withdrawn from and deposited to. https://github.com/samsquire/multiversion-concurrency-contro...

  ShardedBank.java
  ShardedBank2.java
  ShardedBankNonRandom.java
  ShardedHashMap.java
  ShardedNonRandomBank2.java
  ShardedTotalOrder.java
  ShardedTotalRandomOrder.java

I get 700 million requests per second over 12 threads due to the sharding of money over accounts. Here I prioritise throughput of transacitons per second over balance checks (a consistent view of the data).

I am also experimenting with left-right concurrency control which trades memory usage for performance, it basically keeps two copies of data, one that is currently being read and written to and another inactive copy that is not active. You switch the data structures around periodically to "make changes visible" to that thread.

saagarjha3y ago

This is a great way to structure your code if it’s possible to do so, but this isn’t always the case unfortunately :(

kilotaras3y ago· 1 in thread

- A lot of code won't work for types with no default constructors, but that is at least compile error

- Using memcpy[0] for arbitrary types is just wrong, see [1]

[0] https://github.com/DNedic/lockfree/blob/main/lockfree/inc/bi...

[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p11...

dnedicOP3y ago

This is already noted in the Queue readme, only the Queue constructs the type, the other 2 data structures are meant for PODs only.

I will take a look at adding support for constructing in-place for the other 2 data structures, but at the moment, they are just for PODs.

jylam3y ago· 1 in thread

Maybe I don't understand the concept, but aren't lock-free structures "just" delegating the locking mechanism to the CPU via atomic operations ? (although even if that's the case I can understand the speedup) (and if so, why aren't all those lockfree structures the default, instead of using mutexes ?)

gpderetta3y ago

Atomic operations can't be meaningfully said to perform any locking.

In any case lock-free is defined in term of progress guarantees.

up2isomorphism3y ago

During my entire career as a system programmer, I only see 1 situation where lock free is actually justified, 99% percent of the time it is completely unnecessary and even harmful.

j / k navigate · click thread line to collapse

81 comments

46 comments · 9 top-level

bluGill3y ago· 17 in thread

From the FAQ:

ot3y ago

Good lock-free data structures OTOH require just one CAS (or other fenced RMW) on the shared state.

kerkeslager3y ago

Computers are shockingly complex. I can't tell you how many times I've reasoned about a system, ran the profiler, and discovered I was completely wrong.

2 more replies

OskarS3y ago

4 more replies

josephg3y ago

2 more replies

dragontamer3y ago

Pikus has a number of C++ Parallelism performance talks that discusses this issue.

So an unsynchronized "counter++" will be done ~4-billion times a second.

But a "counter++; sync()" will slow you down 50x slower. That is to say, the "sync()" is the "expensive part" to think about.

------------

A lock is two sync() statements. One sync() when you lock, and a 2nd sync() when you unlock. Really, half-sync() since one is an acquire half-sync and the other is a release half-sync.

gpderetta3y ago

Case in point: A CAS on intel is 20ish cycles, the L3 latency is 30-40 cycles or more. On the other hand you can have multiple L3 misses outstanding, but CAS hardly pipelines.

1 more reply

planede3y ago

kevincox3y ago

1 more reply

m4nu3l3y ago

1 more reply

maldev3y ago

You can literally just do an atomic swap on some memory location. Three lines of assembly, like.

    mov rax, 1  ; load the value to exchange into rax
   acquire_lock:
    ; attempt to acquire the lock
    xchg byte [ADDR_LOCK], al  ; atomically swap the lock value with rax
    test al, al  ; test if the original lock value was 0 (unlocked)
    jnz acquire_lock  ; if it was not, loop until we can acquire the lock

The downside is you want a backoff to sleep the thread so it doesn't go into a loop. But the actual lock code is simple. You can easily have this be your function "AcquireLock()" and then do

while(!AcquireLock()) { //pause thread execution. }

And I think this is where they get the syscall being needed, since this will normally require a syscall to pause the thread from the scheduler.

RcouF1uZ4gsC3y ago

For me the biggest benefit of lock free programming is avoiding deadlocks.

Locks are not composable. Unless you are aware of what locks every function in your call tree is using, you can easily end up with a deadlock.

bcrl3y ago

dnedicOP3y ago

The linked talk from Herb Sutter says as much, but it is a good suggestion to make that visible upfrontm, thanks.

Additionally, cacheline alignment of indexes is something that's there to mitigate the false sharing phenomenom and reduce some of the cache synchronization cost.

ashvardanian3y ago

Both lock-free and mutex-based approaches have their applications. The general rule of thumb, for 2-4 threads in the same NUMA node lock-free is faster. Need more cores? Use a proper heavy mutex.

smat3y ago

quietbritishjim3y ago

1 more reply

duped3y ago

You often can tolerate the latency. The problem of locks is they are potentially unbounded and you can miss a deadline.

It's not about performance so much as determinism.

1 more reply

cjensen3y ago· 8 in thread

An issue in C++ is that it only supports atomic changes to the builtin types. For example, you can only CAS a 64-bit value if your largest integer/pointer type is 64-bits.

gpderetta3y ago

[1] which actually does full 128bit compare and swap, you are probably confusing it with the Itanium variant.

moonchild3y ago

ot3y ago

> Not sure if it has changed since.

It hasn't, Clang/GCC emit a cmpxchg16b only if you opt-in with `-mcx16`, which changes the ABI.

aseipp3y ago

The lack of DWCAS as a primitive in general is really annoying, C++ aside. RISC-V's -A extension has no form of it, either; you only get XLEN-sized AMO + LL/SC (no 2*XLEN LL/SC either!)

It's one of those features that when you want it, you really really really want it, and the substitutions are all pretty bad in comparison.

CyberDildonics3y ago

> Good lock free algorithms use double-width instructions like cmpxchg16b which compare 64-bits but swap 128-bits

The instructions should compare 128 bits and swap 128 bits.

I don't know why 'good' algorithms would use these if they don't need to, because 128 bit operations are slower.

Not only that, 128 bit compare and swap doesn't work if it is not 128 bit aligned while 64 bit compare and swap will work even if they aren't 64 bit aligned.

gpderetta3y ago

1 more reply

dnedicOP3y ago

True, that would help immensely in creating MPMC data structures, but as these are SPSC there is no problem. Also to clarify, this is only for the indexes, the data members can be anything.

hedora3y ago

Notbrainiac3y ago· 5 in thread

Koshkin3y ago

Note that "lock-free" is a technical term that has a very specific, precise meaning (implying that the user of the structure does not need to use locking explicitly).

dnedicOP3y ago

Even in single producer single consumer scenarios you need locks for multithreaded/interrupt use if you're not properly using atomics and proper fences.

jimktrains23y ago

You may need a lock if you have operations that are not atomic, even with a single writer, as a reader could find an inconsistency.

ot3y ago

This seems unnecessarily pedantic. Lock-free conventionally implies concurrent, otherwise it's meaningless.

josephcsible3y ago

Lock-free doesn't mean "doesn't have locks". It means "doesn't need locks to be used concurrently".

whiteboardma3y ago· 3 in thread

I've only looked at the queue implementation, but both push and pop contain obvious race conditions; I would highly suggest adding tests that actually use the data structures from multiple threads.

dnedicOP3y ago

gjadi3y ago

You could use TLA+ to model the data structure operations and check the invariant. Checking the invariant with assert is also useful in my limited experience with concurrency.

https://lamport.azurewebsites.net/tla/tla.html

1 more reply

whiteboardma3y ago

2 more replies

andersced3y ago· 1 in thread

Should be benchmarked against ->

https://github.com/Deaod/spsc_queue

If proven faster OK.. If not.. Well.. back to the drawing board.

I gave it a try -> https://github.com/andersc/fastqueue

Deaod is the kingpin.

jcelerier3y ago

https://max0x7ba.github.io/atomic_queue/html/benchmarks.html for an existing set of benchmarks where this could be added

samsquire3y ago· 1 in thread

I think I lean towards per-thread sharding instead of mutex based or lock free data structures except for lockfree ringbuffers.

You can get embarassingly parallel performance if you split your data by thread and aggregate periodically.

If you need a consistent view of your entire set of data, that is a slow path with sharding.

In my experiments with multithreaded software I simulate a bank where many bankaccounts are randomly withdrawn from and deposited to. https://github.com/samsquire/multiversion-concurrency-contro...

  ShardedBank.java
  ShardedBank2.java
  ShardedBankNonRandom.java
  ShardedHashMap.java
  ShardedNonRandomBank2.java
  ShardedTotalOrder.java
  ShardedTotalRandomOrder.java

saagarjha3y ago

This is a great way to structure your code if it’s possible to do so, but this isn’t always the case unfortunately :(

kilotaras3y ago· 1 in thread

- A lot of code won't work for types with no default constructors, but that is at least compile error

- Using memcpy[0] for arbitrary types is just wrong, see [1]

[0] https://github.com/DNedic/lockfree/blob/main/lockfree/inc/bi...

[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p11...

dnedicOP3y ago

This is already noted in the Queue readme, only the Queue constructs the type, the other 2 data structures are meant for PODs only.

I will take a look at adding support for constructing in-place for the other 2 data structures, but at the moment, they are just for PODs.

jylam3y ago· 1 in thread

gpderetta3y ago

Atomic operations can't be meaningfully said to perform any locking.

In any case lock-free is defined in term of progress guarantees.

up2isomorphism3y ago

During my entire career as a system programmer, I only see 1 situation where lock free is actually justified, 99% percent of the time it is completely unnecessary and even harmful.

j / k navigate · click thread line to collapse