> The biggest reason you would want to use a lockfree data structure in such a scenario would be performance. Locking has a non-neglegible runtime cost on hosted systems as every lock requires a syscall.
This is misleading. While a lock does have a runtime cost, in some cases that cost is less than all the force CPU cache synchronization calls that lock free needs to do. With a lock you only have to sync once, after all the operations are done. You need to carefully measure this to see which is more performant for your application.
A mutex transaction generally requires 2 fences, one on lock and one on unlock. The one on unlock would not be strictly necessary in principle (on x86 archs the implicit acquire-release semantics would be enough) but you generally do a CAS anyway to atomically check whether there are any waiters that need a wake-up, which implies a fence.
Good lock-free data structures OTOH require just one CAS (or other fenced RMW) on the shared state.
Besides, at large scale, no matter how small your critical section is, it will be preempted every once in a while, and when you care about tail latency that is visible. Lock-free data structures have more predictable latency characteristics (even better if wait-free).
Computers are shockingly complex. I can't tell you how many times I've reasoned about a system, ran the profiler, and discovered I was completely wrong.
When I was working on an interpreter for a Lisp, I implemented my first cut of scopes (all the variables within a scope and their values) as a naive unsorted list of key/value pairs, thinking I'd optimize later. When I came back to optimize, I reimplemented this as a hashmap, but when I ran my test programs, to my horror, they were all 10x slower. I plugged in a hashmap library used in lots of production systems and got a significant 2x performance gain, which was still slower than looping over an unsorted list of key/value pairs. The fact is, most scopes have <10 variables, and at that size, looping over a list is faster than the constant time of a hashmap. I can reason about why this is, but that's just fitting my reasons to the facts ex-post-facto. Reasoning didn't lead me to the correct answer, observation did.
Returning to parallel data structures, the fact is, I don't know why lock-free structures are faster than mutex-based structures, I just know that they are in every situation where I've profiled them.
Reasoning isn't completely useless--reasoning is how you intuit what you should be profiling. But if you're just reasoning about how two alternatives will perform and not profiling them in real-life production systems you're wasting everyone's time.
Moreover: the point with mutexes is that your data structure can be the optimized assuming no thread-safety. There are lots of, like, hyper-optimized hash table variants (with all sorts of SIMD nonsense and stuff) that are just not possible to do lock-free. The very "lock-freedomness" of the datastructure slows it down enough that in low contention scenarios mutexes clearly would outperform them without being particularly contrived.
In particular, the "cost of synchronization" is roughly the price of a L3 memory read/write, or ~50 cycles. In contrast, a read/write to L1 cache is like 4 cycles latency and can be done multiple times per clock tick, and you can go faster (IE: register space).
So an unsynchronized "counter++" will be done ~4-billion times a second.
But a "counter++; sync()" will slow you down 50x slower. That is to say, the "sync()" is the "expensive part" to think about.
------------
A lock is two sync() statements. One sync() when you lock, and a 2nd sync() when you unlock. Really, half-sync() since one is an acquire half-sync and the other is a release half-sync.
If your lock-free data structure uses more than two sync() statements, you're _probably_ slower than a dumb lock-based method (!!!). This is because the vast majority of lock()/unlocks() are uncontested in practice.
So the TL;DR is, use locks because they're so much easier to think about. When you need more speed, measure first, because it turns out that locks are really damn fast in practice and kind of hard to beat.
That being said, there's other reasons to go lock-free. On some occasions, you will be forced to use lock-free code because blocking could not be tolerated. (Ex: lock-free interrupts. Its fine if the interrupt takes a bit longer than expected during contention). So in this case, even if lock-free is slower, the guarantee for forward progress is what you're going for, rather than performance.
So the study of lock-free data-structures is still really useful. But don't always think of for performance reasons, because these data-structures very well will be slower than std-library + lock() in many cases.
Separately fences and atomic RMWs are slower than plain read/writes, but that's because of the (partially) serialising effects they have on a CPU pipleline, and very little todo with L3 (or any memory) latency.
Case in point: A CAS on intel is 20ish cycles, the L3 latency is 30-40 cycles or more. On the other hand you can have multiple L3 misses outstanding, but CAS hardly pipelines.
mov rax, 1 ; load the value to exchange into rax
acquire_lock:
; attempt to acquire the lock
xchg byte [ADDR_LOCK], al ; atomically swap the lock value with rax
test al, al ; test if the original lock value was 0 (unlocked)
jnz acquire_lock ; if it was not, loop until we can acquire the lock
The downside is you want a backoff to sleep the thread so it doesn't go into a loop. But the actual lock code is simple. You can easily have this be your function "AcquireLock()" and then dowhile(!AcquireLock()) { //pause thread execution. }
And I think this is where they get the syscall being needed, since this will normally require a syscall to pause the thread from the scheduler.
Locks are not composable. Unless you are aware of what locks every function in your call tree is using, you can easily end up with a deadlock.
Additionally, cacheline alignment of indexes is something that's there to mitigate the false sharing phenomenom and reduce some of the cache synchronization cost.
It's not about performance so much as determinism.
Good lock free algorithms use double-width instructions like cmpxchg16b which compare 64-bits but swap 128-bits. You can then use the compared 64-bits as a kind of version number to prevent the a-b-a problem.
Using only the built-in atomics is working with a hand tied behind your back. With the wider version, it's trivial to write multi-producer multi-consumer stacks with no limits to the number of objects stored. It's also pretty easy (if you copy the published algorithm) to do the same with queues.
So a 128 bit std::atomic on intel was not only suboptimal as the compiler had to use 2cas for load and stores as well, but actually wrong as an atomic load from readonly memory would fault. So at some point the ABI was changed to use the spinlock pool. Not sure if it has changed since.
If you do it "by hand", when you only need a 2cas, a 128 bit load that is not atomic is fine as any tearing will be detected by the CAS and 'fixed', but it is hard for the compiler to optimize generic code.
[1] which actually does full 128bit compare and swap, you are probably confusing it with the Itanium variant.
It hasn't, Clang/GCC emit a cmpxchg16b only if you opt-in with `-mcx16`, which changes the ABI.
It's one of those features that when you want it, you really really really want it, and the substitutions are all pretty bad in comparison.
The instructions should compare 128 bits and swap 128 bits.
I don't know why 'good' algorithms would use these if they don't need to, because 128 bit operations are slower.
Not only that, 128 bit compare and swap doesn't work if it is not 128 bit aligned while 64 bit compare and swap will work even if they aren't 64 bit aligned.
Using these intrinsics or inline assembly would break portability or create situations where platforms have different feature levels, which is not something I intend to do. I want the library to be compatible with everything from a tiny MCU to x86.
https://github.com/Deaod/spsc_queue
If proven faster OK.. If not.. Well.. back to the drawing board.
I gave it a try -> https://github.com/andersc/fastqueue
Deaod is the kingpin.
You can get embarassingly parallel performance if you split your data by thread and aggregate periodically.
If you need a consistent view of your entire set of data, that is a slow path with sharding.
In my experiments with multithreaded software I simulate a bank where many bankaccounts are randomly withdrawn from and deposited to. https://github.com/samsquire/multiversion-concurrency-contro...
ShardedBank.java
ShardedBank2.java
ShardedBankNonRandom.java
ShardedHashMap.java
ShardedNonRandomBank2.java
ShardedTotalOrder.java
ShardedTotalRandomOrder.java
I get 700 million requests per second over 12 threads due to the sharding of money over accounts. Here I prioritise throughput of transacitons per second over balance checks (a consistent view of the data).I am also experimenting with left-right concurrency control which trades memory usage for performance, it basically keeps two copies of data, one that is currently being read and written to and another inactive copy that is not active. You switch the data structures around periodically to "make changes visible" to that thread.
- Using memcpy[0] for arbitrary types is just wrong, see [1]
[0] https://github.com/DNedic/lockfree/blob/main/lockfree/inc/bi...
[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p11...
I will take a look at adding support for constructing in-place for the other 2 data structures, but at the moment, they are just for PODs.
In any case lock-free is defined in term of progress guarantees.