An Introduction to Lock-Free Programming (opens in new tab)

(preshing.com)

128 pointspreshing14y ago20 comments

20 comments

15 comments · 6 top-level

haberman14y ago· 4 in thread

Nice article, though I think that any intro-level material on lock-free programming should always include a "don't try this at home for anything important" warning. Until you have some experience with this stuff you will almost certainly make mistakes, but these mistakes might only manifest themselves as crashes in extremely rare circumstances.

I wrote my first lock-free code in 2004 based on reading some papers by Maged Michael from IBM. I wrote a lock-free FIFO in PowerPC assembly, and was convinced it was safe and robust. When I emailed Maged about it, he pointed out that if a thread was suspended on one specific instruction and some specific memory was unmapped before it could run again, the program could crash. I was amazed; I had thought hard about this algorithm, but had completely missed that possibility.

Some other specific notes about the article:

> Basically, if some part of your program satisfies the following conditions, then that part can rightfully be considered lock-free.

The are actually several levels of lock-freedom defined in the literature: lock-freedom, wait-freedom, and obstruction-freedom. For more info see: http://en.wikipedia.org/wiki/Non-blocking_algorithm

> Processors such as PowerPC and ARM expose load-link/store-conditional instructions, which effectively allow you to implement your own RMW primitive at a low level, though this is not often done.

One benefit of load-linked/store-conditional (often abbreviated LL/SC) is that it avoids the ABA problem (http://en.wikipedia.org/wiki/ABA_problem). In practice this doesn't matter that much since x86 doesn't support LL/SC, but I just think it's an interesting factoid to know.

> For instance, PowerPC and ARM processors can change the order of memory stores relative to the instructions themselves, but the x86/64 family of processors from Intel and AMD cannot.

(I've edited my reply here since my original assertion was incorrect). It's true that x86/64 won't reorder stores (see http://en.wikipedia.org/wiki/Memory_ordering for details) but it will reorder loads, so memory barriers are still required in some situations. However I believe that the atomic instructions ("lock cmpxchg", and "lock xadd") imply full barriers on x86.

preshingOP14y ago

I totally agree with you about the dangers of lock-free programming, and that every programmer will certainly make mistakes. I feel strongly enough about it that I actually wrote several paragraphs about the importance of stress testing lock-free algorithms, which I eventually decided to break off into a separate post... I might get around to publishing that, we'll see. We're not the first to feel this way :)

> The are actually several levels of lock-freedom defined in the literature: lock-freedom, wait-freedom, and obstruction-freedom. For more info see: http://en.wikipedia.org/wiki/Non-blocking_algorithm

It's true, and I had a hard time reconciling that Wikipedia page with how others had been describing lock-free programming, which is why I spent so much of the post on semantics.

> One benefit of load-linked/store-conditional (often abbreviated LL/SC) is that it avoids the ABA problem

Good point. Though I suspect if you access other cache lines between the LL & SC, it may invalidate the reservation.

> I don't think this is true about x86/64.

It is true about x86/64, at least for the most common, non-SSE instructions and normal, non-write-combined memory. I'll update the post to be more specific. See volume 3, section 8.2.2 of the Intel 64 and IA-32 Developer's Manuals: (http://www.intel.com/content/www/us/en/processors/architectu...)

- Reads are not reordered with other reads. - Writes are not reordered with older reads. - Writes to memory are not reordered with other writes (with a few exceptions for special instructions)

ajross14y ago

> It is true about x86/64, at least for the most common, non-SSE instructions and normal, non-write-combined memory

To embellish further and avoid jargon for those who didn't parse this, it basically means "It is true except for the interfaces which explicitly document where it's not.".

Intel CPUs have had a "uncached/write-combining" mode for years (originally specified through special MTRR registers, then in the Page Attribute Table, and most recently in the "non-temporal" MOV instructions in SSE) which officially relaxes the requirements for store ordering. The reason for having two sets of semantics is performance: when dealing with uncached memory (typically memory used by another device, e.g. a video framebuffer) combining a bunch of writes to the same DRAM burst unit (or set of such units) is much faster than doing each word-sized I/O individually.

haberman14y ago

> It is true about x86/64

Yes, sorry I misspoke initially about stores being reordered, but it does still appear that stores can be reordered after loads, see: http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fen... . (Your Intel manual reference is surely a more authoritative source but I'm not able to do a deep dive into it at the moment).

1 more reply

scott_s14y ago

To support your first point, I implemented Maged's lock-free allocator so we could compare against it (http://www.scott-a-s.com/files/michael.tar.gz), and I never got it working correctly on Itanium. I spent weeks trying to figure out what was wrong. I eventually concluded that I did not understand the Itanium architecture and the algorithms used in the allocator well enough to be confident I could fix it - and that it wasn't worth improving my understanding of either. (I asked Maged if he had any experience with his allocator on Itanium, and he did not.)

tptacek14y ago· 4 in thread

The idea that made this click for me 10 years or so ago was that distributed systems express concurrent processes but can't reasonably use locks; model your threads/processes/whatever as if they were nodes in a distributed system, and use distributed system techniques to ensure things converge.

Retric14y ago

That's not exactly a shared memory model. A better example would be a static look-up table for fixed game state information and a bunch of AI worker threads.

With the correct protocol you can even write to that shared memory. For example a sudoku solver where each tread looks for contradictions and then overwrites the shared states with their findings. This works because there is no contradictions, dirty reads are ok, and there is no contention with all threads writing 0's and nobody writing 1's. You can go even further where any thread can flip the error flag if a contradiction shows up or solution flag if a solution is found, but again it works because nothing clears those flags.

PS: The article's author uses a less strict defintion of lock free where it's ok to use locks for individual atomic operations like counters. But, that still creates performance issues at scale.

preshingOP14y ago

Actually, I thought it was pretty clear in the article that it's not ok to use locks (mutexes) under the definition given. I tried pretty hard to use the most widely accepted definition of lock-free, even exchanging e-mails with Maurice Herlihy about it.

1 more reply

scott_s14y ago

The article's author uses a less strict defintion of lock free where it's ok to use locks for individual atomic operations like counters.

The author's definition does not allow one to use locks to make individual operations atomic. A process could always crash after obtaining the lock, but before releasing it, causing the entire application to become stuck.

strlen14y ago

Lock free algorithms still use primitives like memory barriers. A memory barrier in a shared memory system is expensive, but not nearly as difficult as a memory barrier in a distributed system.

Additionally, distributed systems not only experience failures of individual components, but they're often in a state where you can't reliably detect failures (e.g., tell a failed node apart from a node that's slow or a node that you can't temporarily talk to while other nodes can).

It's also possible to think of it as a continuum: for example, failure detection is simpler in a distributed system connected via Infiniband than in a system connected via commodity ethernet; locality matters in NUMA CPUs and so on. The idea that a commodity multi-core machine has some properties of distributed system is not without merit, but it isn't particularly useful or actionable as a working model.

rdtsc14y ago· 1 in thread

There is also a niche research area of lock-free data structures and algorithms.

On a more practical side, check out http://www.liblfds.org/ -- it is a library of lock free data structures in C. (I am not the author). I have successfully used this library in some realtime projects.

sbahra14y ago

http://concurrencykit.org/ is also available, used in several production systems.

sbahra14y ago

A while back I linked to the following: http://concurrencykit.org/presentations/lockfree_introductio...

Please make sure to navigate all the way down, before navigating to the right. Press the space bar to see the full layout of the presentation.

rossjudson14y ago

Very good overview. I've built lock-free hash maps, and this article covers the areas you need to be aware of. I'd also suggest looking up false sharing, and also making sure that you really understand exactly how the memory barriers are operating.

cheatercheater14y ago

Make sure to check out pi calculus for a different model of concurrency without locks: http://mainisusuallyafunction.blogspot.com/2011/09/lambda-to...

j / k navigate · click thread line to collapse

20 comments

15 comments · 6 top-level

haberman14y ago· 4 in thread

Some other specific notes about the article:

> Basically, if some part of your program satisfies the following conditions, then that part can rightfully be considered lock-free.

The are actually several levels of lock-freedom defined in the literature: lock-freedom, wait-freedom, and obstruction-freedom. For more info see: http://en.wikipedia.org/wiki/Non-blocking_algorithm

> Processors such as PowerPC and ARM expose load-link/store-conditional instructions, which effectively allow you to implement your own RMW primitive at a low level, though this is not often done.

> For instance, PowerPC and ARM processors can change the order of memory stores relative to the instructions themselves, but the x86/64 family of processors from Intel and AMD cannot.

preshingOP14y ago

It's true, and I had a hard time reconciling that Wikipedia page with how others had been describing lock-free programming, which is why I spent so much of the post on semantics.

> One benefit of load-linked/store-conditional (often abbreviated LL/SC) is that it avoids the ABA problem

Good point. Though I suspect if you access other cache lines between the LL & SC, it may invalidate the reservation.

> I don't think this is true about x86/64.

- Reads are not reordered with other reads. - Writes are not reordered with older reads. - Writes to memory are not reordered with other writes (with a few exceptions for special instructions)

ajross14y ago

> It is true about x86/64, at least for the most common, non-SSE instructions and normal, non-write-combined memory

To embellish further and avoid jargon for those who didn't parse this, it basically means "It is true except for the interfaces which explicitly document where it's not.".

haberman14y ago

> It is true about x86/64

1 more reply

scott_s14y ago

tptacek14y ago· 4 in thread

Retric14y ago

That's not exactly a shared memory model. A better example would be a static look-up table for fixed game state information and a bunch of AI worker threads.

PS: The article's author uses a less strict defintion of lock free where it's ok to use locks for individual atomic operations like counters. But, that still creates performance issues at scale.

preshingOP14y ago

1 more reply

scott_s14y ago

The article's author uses a less strict defintion of lock free where it's ok to use locks for individual atomic operations like counters.

strlen14y ago

Lock free algorithms still use primitives like memory barriers. A memory barrier in a shared memory system is expensive, but not nearly as difficult as a memory barrier in a distributed system.

rdtsc14y ago· 1 in thread

There is also a niche research area of lock-free data structures and algorithms.

sbahra14y ago

http://concurrencykit.org/ is also available, used in several production systems.

sbahra14y ago

A while back I linked to the following: http://concurrencykit.org/presentations/lockfree_introductio...

Please make sure to navigate all the way down, before navigating to the right. Press the space bar to see the full layout of the presentation.

rossjudson14y ago

cheatercheater14y ago

Make sure to check out pi calculus for a different model of concurrency without locks: http://mainisusuallyafunction.blogspot.com/2011/09/lambda-to...

j / k navigate · click thread line to collapse