Optimizing a lock-free ring buffer (opens in new tab)

(david.alvarezrosa.com)

104 pointsdalvrosa3mo ago95 comments

95 comments

56 comments · 15 top-level

kristianp3mo ago· 11 in thread

This is in C++, other languages have different atomic primitives.

Don't most people use C++11 atomics now? You have SeqCst, Release, Acquire, and Relaxed (with Consume deprecated due to the difficulty of implementing it). You can do loads, stores, and exchanges with each ordering type. Zig, Rust, and C all use the same orderings. I guess Java has its own memory model since it's been around a lot longer, but most people have standardized around C++'s design.

Which is a slight shame since Load-Linked/Store-Conditional is pretty cool, but I guess that's limited to ARM anyways, and now they've added extensions for CAS due to speed.

superxpro123mo ago

I've taken an interest in lock-free queues for ultra-low power embedded... think Cortex-m0, or even avr/pic.

Things get interesting when you're working with a cpu that lacks the ldrex/strem assembly instructions that makes this all work. I think youre only options at that point are disable/enable interrupts. IF anyone has any insights into this constraint I'd love to hear it.

1 more reply

j_seigh2mo ago

My impression was LL/SC had forward progress issues due to the difficulties of preventing false sharing of the locked memory reservation region. Updates into that region would keep invalidating the lock.

I had a version of atomic* reference counting that used LL/SC on a ppc mac mini along side x86 versions using cmpxchg16b. Code used to be sourceforge before it went to the dark side.

An early posting of the idea before I got around to implementing it. https://groups.google.com/g/comp.programming.threads/c/HZqn5...

* Std::shared_ptr and Rust ARC aren't actually atomic. You have to own a reference to do a copy. The are what POSIX calls thread-safe. With atomic reference counting, if you copy a reference, you either get a valid reference or null. Like Java references.

loeg3mo ago

LL/SC is still hinted at in the C++11 model with std::atomic<T>::compare_exchange_weak:

https://en.cppreference.com/w/cpp/atomic/atomic/compare_exch...

jitl3mo ago

Really? Pretty much all atomics i’ve used have load, store of various integer sizes. I wrote a ring buffer in Go that’s very similar to the final design here using similar atomics.

https://pkg.go.dev/sync/atomic#Int64

wat100003mo ago

They generally map directly to concepts in the CPU architecture. On many architectures, load/store instructions are already guaranteed to be atomic as long as the address is properly aligned, so atomic load/store is just a load/store. Non-relaxed ordering may emit a variant load/store instruction or a separate barrier instruction. Compare-exchange will usually emit a compare and swap, or load-linked/store-conditional sequence. Things like atomic add/subtract often map to single instructions, or might be implemented as a compare-exchange in a loop.

The exact syntax and naming will of course differ, but any language that exposes low-level atomics at all is going to provide a pretty similar set of operations.

3 more replies

dalvrosaOP3mo ago

Nice one, thanks for sharing. Do you wanna share the ring buffer code itself?

blacklion3mo ago

JVM has almost the same (C++ memory model was modeled after JVM one, with some subtle fixes).

dalvrosaOP3mo ago

Yeah, this is quite specific to C++ (at a syntax level)

amluto3mo ago

Huh? Other languages that compile to machine code and offer control over struct layout and access to the machine’s atomic will work the same way.

Sure, C++ has a particular way of describing atomics in a cross-platform way, but the actual hardware operations are not specific to the language.

dalvrosaOP3mo ago

Yeah, different languages will have different syntaxes and ways of using atomics

But at the hardware level all are kindof the same

ramon1563mo ago· 4 in thread

Something to add to this; if you're focussing on these low-level optimizations, make sure the device this code runs on is actually tuned.

A lot of people focus on the code and then assume the device in question is only there to run it. There's so much you can tweak. I don't always measure it, but last time I saw at least a 20% improvement in Network throughput just by tweaking a few things on the machine.

hansvm3mo ago

That reminds me of one of the easiest big wins I've had in my career. SystemD was causing issues, so I slapped in Gentoo with the real-time kernel patch. Peak latency (practically speaking, the only core metric we cared about -- some control loop doing a bunch of expensive math and interacting with real hardware) went down 5000x.

That specific advice isn't terribly transferable (you might choose to hack up SystemD or some other components instead, maybe even the problem definition itself), but the general idea of measuring and tuning the system running your code is solid.

kajaktum3mo ago

What do you think is causing the issue? We are having the same kind of problem. Core isolation, no_hz, core pinning, but i am still getting interrupted by nmi interrupts

1 more reply

dalvrosaOP3mo ago

Agreed. For benchmarking I used this <https://github.com/david-alvarez-rosa/CppPlayground/blob/mai...> which relies on GoogleBenchmark and pins producer/consumer threads to dedicated CPU cores

What else could be improved? Would like to learn :)

Maybe using huge pages?

dijit3mo ago

kernel tickrate is a pretty big one, most people don't bother and use what their OS ships with.

Disabling c-states, pinning network interfaces to dedicated cores (and isolating your application from those cores) and `SCHED_FIFO` (chrt -f 99 <prog>) helps a lot.

Transparent hugepages increase latency without you being aware of when it happens, I usually disable that.

Idk, there's a bunch but they all depend on your use-case. For example I always disable hyperthreading because I care more about latency than processing power- and I don't want to steal cache from my workload randomly.. but some people have more I/O bound workloads and hyperthreading is just and strict improvement in those situations.

1 more reply

erickpintor3mo ago· 4 in thread

Great post!

Would you mind expanding on the correctness guarantees enforced by the atomic semantics used? Are they ensuring two threads can't push to the same slot nor pop the same value from the ring? These type of atomic coordination usually comes from CAS or atomic increment calls, which I'm not seeing, thus I'm interested in hearing your take on it.

erickpintor3mo ago

I see you replied on comment below with:

> note that there are only one consumer and one producer

That clarify things as you don't need multi-thread coordination on reads or writes if assuming single producer and single consumer.

dalvrosaOP3mo ago

Exactly, that's right

dalvrosaOP3mo ago

Thanks! That's not ensured, optimizations are only valid due to the constraints

- One single producer thread

- One single consumer thread

- Fixed buffer capacity

So to answer

> Are they ensuring two threads can't push to the same slot nor pop the same value from the ring?

No need for this usecase :)

loeg3mo ago

This is a SPSC queue -- there aren't multiple writers to coordinate, nor readers. It simplifies the design.

jeffbee3mo ago· 4 in thread

It's lock-free because it uses ordered loads and stores, which is also how you implement locks. I find the semantic distinction unconvincing. The post is really about how slow the default STL mutex implementation is.

pjdesno3mo ago

That's what "lock-free" means. You still need to use the hardware mechanisms provided for atomicity.

The whole point of lock-free data structures and algorithms is that sometimes you can do better by using these atomic operations inside your own code, rather than using a one-size-fits-all mutex based on those same atomic operations.

(Note that I say "sometimes". Too many people believe that lock-free structures are always faster; as always, your mileage may vary. In this case it's a huge win, to the point where I would bet it almost always moves the bottleneck to the code actually using the ring buffer.)

jeffbee3mo ago

My point is that the "huge win" is expressed in terms of a bogus and misleading baseline. The article moves immediately from the worst possible lock-based implementation to a pretty bad atomics-based implementation. The final punchline of the article is expressed as a ratio of the bad baseline. To make an honest conclusion, the article should also explore better ways of using the locks.

1 more reply

loeg3mo ago

There are real practical implications of both the producer and consumer mutating the same cache line to take a lock that is fundamentally avoided by this "lock-free" design. It isn't meaningless.

jeffbee3mo ago

That only explains the last stage. In order to steelman the mutex alternative, everything before "further optimization" should have used 2 critical sections. That would give a realistic baseline.

1 more reply

JonChesterfield3mo ago· 4 in thread

It's obviously, trivially broken. Stores the index before storing the value, so the other thread reads nonsense whenever the race goes against it.

Also doesn't have fences on the store, has extra branches that shouldn't be there, and is written in really stylistically weird c++.

Maybe an llm that likes a different language more, copying a broken implementation off github? Mostly commenting because the initial replies are "best" and "lol", though I sympathise with one of those.

loeg3mo ago

> It's obviously, trivially broken. Stores the index before storing the value, so the other thread reads nonsense whenever the race goes against it.

Are we reading the same code? The stores are clearly after value accesses.

> Also doesn't have fences on the store

?? It uses acquire/release semantics seemingly correctly. Explicit fences are not required.

JonChesterfield3mo ago

Push:

buffer_[head] = value;

head_.store(next_head, std::memory_order_release);

return true;

There's no relationship between the two written variables. Stores to the two are independent and can be reordered. The aq/rel applies to the index, not to the unrelated non-atomic buffer located near the index.

4 more replies

dalvrosaOP3mo ago

Sorry, but that's not actually true. There are no data races, the atomics prevent that (note that there are only one consumer and one producer)

Regarding the style, it follows the "almost always auto" idea from Herb Sutter

secondcoming3mo ago

If you enforce that the buffer size is a power of 2 you just use a mask to do the

    if (next_head == buffer.size())
        next_head = 0;

part

3 more replies

dalvrosaOP3mo ago· 3 in thread

From 12M ops/s to 305 M ops/s on a lock-free ring buffer.

In this post, I walk you step by step through implementing a single-producer single-consumer queue from scratch.

This pattern is widely used to share data between threads in the lowest-latency environments.

loeg3mo ago

Your blog footer mentions that code samples are GPL unless otherwise noted. You don't seem to note otherwise in the article, so -- do you consider these snippets GPL licensed?

dalvrosaOP3mo ago

Actually I'm not sure. GPL was for source code of the website itself

I guess the code samples inside post are under https://david.alvarezrosa.com/LICENSE

But feel free to ping me if you need different license, quite open about it

random33mo ago

Ring buffers never get old. Here’s a useful mention of some of the most extensive technical work by LMAX team over 15 years ago https://martinfowler.com/articles/lmax.html

kevincox3mo ago· 2 in thread

Random idea: If you have a known sentinel value for empty could you avoid the reader needing to read the writer's index? Just try to read, if it is empty the queue is empty, otherwise take the item and put an empty value there. Similarly for writing you can check the value, if it isn't empty the queue is full.

It seems that in this case as you get contention the faster end will slow down (as it is consuming what the other end just read) and this will naturally create a small buffer and run at good speeds.

The hard part is probably that sentinel and ensuring that it can be set/cleared atomically. On Rust you can do `Option<T>` to get a sentinel for any type (and it very often doesn't take any space) but I don't think there is an API to atomically set/clear that flag. (Technically I think this is always possible because the sentinel that Option picks will always be small even if the T is very large, but I don't think there is an API for this.)

loeg3mo ago

Yeah, or you could put a generation number in each slot adjacent to T and a read will only be valid if the slot's generation number == the last one observed + 1, for example. But ultimately the reader and writer still need to coordinate here, so we're just shifting the coordination cache line from the writer's index to the slot.

kevincox3mo ago

I think the key difference is that they only need to coordinate when the reader and writer are close together. If that slows one end down they naturally spread apart. So you don't lose throughput, only a little latency in the contested case.

1 more reply

kev9462mo ago· 2 in thread

Is it okay for push and pop to have noexcept when copy assignment of T could throw?

loeg2mo ago

I'm not sure C++ provides a more satisfying answer here than "don't use this with a T that throws in copy." (And also, why would you want that?)

kev9462mo ago

I was just wondering, because the functions are noexcept in OP's code

mikhmha3mo ago· 1 in thread

Lock-free ring buffer is my favorite data structure. I remember implementing it in C++ and then using a legitimate implementation in the form of boost:SPSC for prod. The idea is so simple. And then I started thinking about designing some programming language or framework around the concept, only to then stumble upon the idea of "message passing" for concurrency. Which of course led me to learn about Erlang. And then I went down the Erlang rabbit hole. It might have been a mistake...I made more money doing C++.

dalvrosaOP3mo ago

Lol. Funny story :)

pixelpoet3mo ago· 1 in thread

Great article, thanks for sharing. And such a lovely website too :)

dalvrosaOP3mo ago

Thanks for the feedback <3

nitwit0053mo ago· 1 in thread

It would be nice to have an example use case where the technique would show a benefit.

It seems relatively rare to have a single producer and consumer thread, and be worth polling a ring buffer.

ohazi3mo ago

I use my own very similar version of this spsc lock-free ring buffer on almost every embedded project I work on that has to stream any sort of sampled data (e.g. audio). You can even have the consumer end be a DMA into something like a uart or USB peripheral so your microcontroller userspace doesn't have to touch the hardware.

Blackthorn3mo ago· 1 in thread

I had what I thought was a pretty good implementation, but I wasn't aware of the cache line bouncing. Looks like I've got some updates to make.

dalvrosaOP3mo ago

Glad that it helps :)

brcmthrowaway3mo ago· 1 in thread

Is there a C library that I can get these data structures for free?

loeg3mo ago

ConcurrencyKit ck_ring. The SPSC macros are the most similar to this article:

https://github.com/concurrencykit/ck/blob/master/include/ck_...

brcmthrowaway3mo ago· 1 in thread

Random q: What was the first cpu to support atomic instructions?

jeffbee3mo ago

I don't know but the IBM 360 and the DEC PDP-10 both had them. Those are the earliest systems I ever saw.

sanufar3mo ago· 1 in thread

Super fun, def gonna try this on my own time later

dalvrosaOP3mo ago

Feel free to share your findings

j / k navigate · click thread line to collapse

95 comments

56 comments · 15 top-level

kristianp3mo ago· 11 in thread

This is in C++, other languages have different atomic primitives.

smj-edison3mo ago

Which is a slight shame since Load-Linked/Store-Conditional is pretty cool, but I guess that's limited to ARM anyways, and now they've added extensions for CAS due to speed.

superxpro123mo ago

I've taken an interest in lock-free queues for ultra-low power embedded... think Cortex-m0, or even avr/pic.

1 more reply

j_seigh2mo ago

I had a version of atomic* reference counting that used LL/SC on a ppc mac mini along side x86 versions using cmpxchg16b. Code used to be sourceforge before it went to the dark side.

An early posting of the idea before I got around to implementing it. https://groups.google.com/g/comp.programming.threads/c/HZqn5...

loeg3mo ago

LL/SC is still hinted at in the C++11 model with std::atomic<T>::compare_exchange_weak:

https://en.cppreference.com/w/cpp/atomic/atomic/compare_exch...

jitl3mo ago

Really? Pretty much all atomics i’ve used have load, store of various integer sizes. I wrote a ring buffer in Go that’s very similar to the final design here using similar atomics.

https://pkg.go.dev/sync/atomic#Int64

wat100003mo ago

The exact syntax and naming will of course differ, but any language that exposes low-level atomics at all is going to provide a pretty similar set of operations.

3 more replies

dalvrosaOP3mo ago

Nice one, thanks for sharing. Do you wanna share the ring buffer code itself?

blacklion3mo ago

JVM has almost the same (C++ memory model was modeled after JVM one, with some subtle fixes).

dalvrosaOP3mo ago

Yeah, this is quite specific to C++ (at a syntax level)

amluto3mo ago

Huh? Other languages that compile to machine code and offer control over struct layout and access to the machine’s atomic will work the same way.

Sure, C++ has a particular way of describing atomics in a cross-platform way, but the actual hardware operations are not specific to the language.

dalvrosaOP3mo ago

Yeah, different languages will have different syntaxes and ways of using atomics

But at the hardware level all are kindof the same

ramon1563mo ago· 4 in thread

Something to add to this; if you're focussing on these low-level optimizations, make sure the device this code runs on is actually tuned.

hansvm3mo ago

kajaktum3mo ago

What do you think is causing the issue? We are having the same kind of problem. Core isolation, no_hz, core pinning, but i am still getting interrupted by nmi interrupts

1 more reply

dalvrosaOP3mo ago

Agreed. For benchmarking I used this <https://github.com/david-alvarez-rosa/CppPlayground/blob/mai...> which relies on GoogleBenchmark and pins producer/consumer threads to dedicated CPU cores

What else could be improved? Would like to learn :)

Maybe using huge pages?

dijit3mo ago

kernel tickrate is a pretty big one, most people don't bother and use what their OS ships with.

Disabling c-states, pinning network interfaces to dedicated cores (and isolating your application from those cores) and `SCHED_FIFO` (chrt -f 99 <prog>) helps a lot.

Transparent hugepages increase latency without you being aware of when it happens, I usually disable that.

1 more reply

erickpintor3mo ago· 4 in thread

Great post!

erickpintor3mo ago

I see you replied on comment below with:

> note that there are only one consumer and one producer

That clarify things as you don't need multi-thread coordination on reads or writes if assuming single producer and single consumer.

dalvrosaOP3mo ago

Exactly, that's right

dalvrosaOP3mo ago

Thanks! That's not ensured, optimizations are only valid due to the constraints

- One single producer thread

- One single consumer thread

- Fixed buffer capacity

So to answer

> Are they ensuring two threads can't push to the same slot nor pop the same value from the ring?

No need for this usecase :)

loeg3mo ago

This is a SPSC queue -- there aren't multiple writers to coordinate, nor readers. It simplifies the design.

jeffbee3mo ago· 4 in thread

pjdesno3mo ago

That's what "lock-free" means. You still need to use the hardware mechanisms provided for atomicity.

jeffbee3mo ago

1 more reply

loeg3mo ago

There are real practical implications of both the producer and consumer mutating the same cache line to take a lock that is fundamentally avoided by this "lock-free" design. It isn't meaningless.

jeffbee3mo ago

That only explains the last stage. In order to steelman the mutex alternative, everything before "further optimization" should have used 2 critical sections. That would give a realistic baseline.

1 more reply

JonChesterfield3mo ago· 4 in thread

It's obviously, trivially broken. Stores the index before storing the value, so the other thread reads nonsense whenever the race goes against it.

Also doesn't have fences on the store, has extra branches that shouldn't be there, and is written in really stylistically weird c++.

loeg3mo ago

> It's obviously, trivially broken. Stores the index before storing the value, so the other thread reads nonsense whenever the race goes against it.

Are we reading the same code? The stores are clearly after value accesses.

> Also doesn't have fences on the store

?? It uses acquire/release semantics seemingly correctly. Explicit fences are not required.

JonChesterfield3mo ago

Push:

buffer_[head] = value;

head_.store(next_head, std::memory_order_release);

return true;

4 more replies

dalvrosaOP3mo ago

Sorry, but that's not actually true. There are no data races, the atomics prevent that (note that there are only one consumer and one producer)

Regarding the style, it follows the "almost always auto" idea from Herb Sutter

secondcoming3mo ago

If you enforce that the buffer size is a power of 2 you just use a mask to do the

    if (next_head == buffer.size())
        next_head = 0;

part

3 more replies

dalvrosaOP3mo ago· 3 in thread

From 12M ops/s to 305 M ops/s on a lock-free ring buffer.

In this post, I walk you step by step through implementing a single-producer single-consumer queue from scratch.

This pattern is widely used to share data between threads in the lowest-latency environments.

loeg3mo ago

Your blog footer mentions that code samples are GPL unless otherwise noted. You don't seem to note otherwise in the article, so -- do you consider these snippets GPL licensed?

dalvrosaOP3mo ago

Actually I'm not sure. GPL was for source code of the website itself

I guess the code samples inside post are under https://david.alvarezrosa.com/LICENSE

But feel free to ping me if you need different license, quite open about it

random33mo ago

Ring buffers never get old. Here’s a useful mention of some of the most extensive technical work by LMAX team over 15 years ago https://martinfowler.com/articles/lmax.html

kevincox3mo ago· 2 in thread

It seems that in this case as you get contention the faster end will slow down (as it is consuming what the other end just read) and this will naturally create a small buffer and run at good speeds.

loeg3mo ago

kevincox3mo ago

1 more reply

kev9462mo ago· 2 in thread

Is it okay for push and pop to have noexcept when copy assignment of T could throw?

loeg2mo ago

I'm not sure C++ provides a more satisfying answer here than "don't use this with a T that throws in copy." (And also, why would you want that?)

kev9462mo ago

I was just wondering, because the functions are noexcept in OP's code

mikhmha3mo ago· 1 in thread

dalvrosaOP3mo ago

Lol. Funny story :)

pixelpoet3mo ago· 1 in thread

Great article, thanks for sharing. And such a lovely website too :)

dalvrosaOP3mo ago

Thanks for the feedback <3

nitwit0053mo ago· 1 in thread

It would be nice to have an example use case where the technique would show a benefit.

It seems relatively rare to have a single producer and consumer thread, and be worth polling a ring buffer.

ohazi3mo ago

Blackthorn3mo ago· 1 in thread

I had what I thought was a pretty good implementation, but I wasn't aware of the cache line bouncing. Looks like I've got some updates to make.

dalvrosaOP3mo ago

Glad that it helps :)

brcmthrowaway3mo ago· 1 in thread

Is there a C library that I can get these data structures for free?

loeg3mo ago

ConcurrencyKit ck_ring. The SPSC macros are the most similar to this article:

https://github.com/concurrencykit/ck/blob/master/include/ck_...

brcmthrowaway3mo ago· 1 in thread

Random q: What was the first cpu to support atomic instructions?

jeffbee3mo ago

I don't know but the IBM 360 and the DEC PDP-10 both had them. Those are the earliest systems I ever saw.

sanufar3mo ago· 1 in thread

Super fun, def gonna try this on my own time later

dalvrosaOP3mo ago

Feel free to share your findings

j / k navigate · click thread line to collapse