I wrote my first lock-free code in 2004 based on reading some papers by Maged Michael from IBM. I wrote a lock-free FIFO in PowerPC assembly, and was convinced it was safe and robust. When I emailed Maged about it, he pointed out that if a thread was suspended on one specific instruction and some specific memory was unmapped before it could run again, the program could crash. I was amazed; I had thought hard about this algorithm, but had completely missed that possibility.
Some other specific notes about the article:
> Basically, if some part of your program satisfies the following conditions, then that part can rightfully be considered lock-free.
The are actually several levels of lock-freedom defined in the literature: lock-freedom, wait-freedom, and obstruction-freedom. For more info see: http://en.wikipedia.org/wiki/Non-blocking_algorithm
> Processors such as PowerPC and ARM expose load-link/store-conditional instructions, which effectively allow you to implement your own RMW primitive at a low level, though this is not often done.
One benefit of load-linked/store-conditional (often abbreviated LL/SC) is that it avoids the ABA problem (http://en.wikipedia.org/wiki/ABA_problem). In practice this doesn't matter that much since x86 doesn't support LL/SC, but I just think it's an interesting factoid to know.
> For instance, PowerPC and ARM processors can change the order of memory stores relative to the instructions themselves, but the x86/64 family of processors from Intel and AMD cannot.
(I've edited my reply here since my original assertion was incorrect). It's true that x86/64 won't reorder stores (see http://en.wikipedia.org/wiki/Memory_ordering for details) but it will reorder loads, so memory barriers are still required in some situations. However I believe that the atomic instructions ("lock cmpxchg", and "lock xadd") imply full barriers on x86.
> The are actually several levels of lock-freedom defined in the literature: lock-freedom, wait-freedom, and obstruction-freedom. For more info see: http://en.wikipedia.org/wiki/Non-blocking_algorithm
It's true, and I had a hard time reconciling that Wikipedia page with how others had been describing lock-free programming, which is why I spent so much of the post on semantics.
> One benefit of load-linked/store-conditional (often abbreviated LL/SC) is that it avoids the ABA problem
Good point. Though I suspect if you access other cache lines between the LL & SC, it may invalidate the reservation.
> I don't think this is true about x86/64.
It is true about x86/64, at least for the most common, non-SSE instructions and normal, non-write-combined memory. I'll update the post to be more specific. See volume 3, section 8.2.2 of the Intel 64 and IA-32 Developer's Manuals: (http://www.intel.com/content/www/us/en/processors/architectu...)
- Reads are not reordered with other reads. - Writes are not reordered with older reads. - Writes to memory are not reordered with other writes (with a few exceptions for special instructions)
To embellish further and avoid jargon for those who didn't parse this, it basically means "It is true except for the interfaces which explicitly document where it's not.".
Intel CPUs have had a "uncached/write-combining" mode for years (originally specified through special MTRR registers, then in the Page Attribute Table, and most recently in the "non-temporal" MOV instructions in SSE) which officially relaxes the requirements for store ordering. The reason for having two sets of semantics is performance: when dealing with uncached memory (typically memory used by another device, e.g. a video framebuffer) combining a bunch of writes to the same DRAM burst unit (or set of such units) is much faster than doing each word-sized I/O individually.
Yes, sorry I misspoke initially about stores being reordered, but it does still appear that stores can be reordered after loads, see: http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fen... . (Your Intel manual reference is surely a more authoritative source but I'm not able to do a deep dive into it at the moment).
With the correct protocol you can even write to that shared memory. For example a sudoku solver where each tread looks for contradictions and then overwrites the shared states with their findings. This works because there is no contradictions, dirty reads are ok, and there is no contention with all threads writing 0's and nobody writing 1's. You can go even further where any thread can flip the error flag if a contradiction shows up or solution flag if a solution is found, but again it works because nothing clears those flags.
PS: The article's author uses a less strict defintion of lock free where it's ok to use locks for individual atomic operations like counters. But, that still creates performance issues at scale.
The author's definition does not allow one to use locks to make individual operations atomic. A process could always crash after obtaining the lock, but before releasing it, causing the entire application to become stuck.
Additionally, distributed systems not only experience failures of individual components, but they're often in a state where you can't reliably detect failures (e.g., tell a failed node apart from a node that's slow or a node that you can't temporarily talk to while other nodes can).
It's also possible to think of it as a continuum: for example, failure detection is simpler in a distributed system connected via Infiniband than in a system connected via commodity ethernet; locality matters in NUMA CPUs and so on. The idea that a commodity multi-core machine has some properties of distributed system is not without merit, but it isn't particularly useful or actionable as a working model.
On a more practical side, check out http://www.liblfds.org/ -- it is a library of lock free data structures in C. (I am not the author). I have successfully used this library in some realtime projects.
Please make sure to navigate all the way down, before navigating to the right. Press the space bar to see the full layout of the presentation.