"And there you have it! During this run, a memory reordering was detected approximately once every 6600 iterations. When I tested in Ubuntu on a Core 2 Duo E6300, the occurrences were even more rare. One begins to appreciate how subtle timing bugs can creep undetected into lock-free code."
If you were shipping software to millions of people, you could reasonably never see a specific race condition error that ends up occurring on many tens of thousands of machines.
And then when you start to consider interactions between infrequent errors...
For example, every few months, OSX kernel panics on my MacBook Air, with no obvious trigger. The backtrace always implicates the AHCI driver for the SATA SSD. Still, Apple evidently don't receive enough crash reports for this particular bug to actually fix it, as it's happened since I first bought the Air in November 2010.
I have no idea whether it's a threading bug like the one in the article (I'm not about to run my system in single-core mode for months just to find out) or maybe a race condition with DMA or just a simple logic error that only rarely applies. It might even be specific to the exact SSD model and revision I ended up with in my device. But that's pretty much irrelevant - there certainly hasn't been a widespread outcry about it.
Personally, as a developer, I would definitely want to fix that kind of bug in anything I'd built. But tracking it down might take weeks of costly developer time. So Apple's bug triage probably marked this one "WONTFIX" after deciding it only happened on hardware they no longer sell, so fixing it wasn't going to have a positive ROI.
(FWIW, if it sounds like I have a personal vendetta against the AHCI driver, that's because I do. ;-) It behaves erratically in ways unrelated to this crash, but that weirdness doesn't cause kernel panics, just extra work for developers of other drivers.)
You "fix" this the same way you fix any race: you get your synchronization right. The instructions used to implement the mutex (e.g. lock cmpxchg) are "serializing", which means that all memory operations issued before them in the instruction stream will be completed and committed before the instruction issues.
This is because all of the cores on the Cell processor are in-order (http://en.wikipedia.org/wiki/Out-of-order_execution#In-order...). That fact is well documented: http://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf
And for the record, the "correct" thing to do is to always use the appropriate memory and instruction fences. Reasoning about where these should go can be subtle.
This is a great explanation: http://irl.cs.ucla.edu/~yingdi/paperreading/whymb.2010.06.07...
Am I incorrect? I can't see how multiple processors can share registers without chaos?
edit: after writing lock-free code years ago, and having very subtle bugs like the ones described here, I don't think it is a good idea anymore, unless you are in a very specific scenario.
Locks can be very slow (relatively). At my last job using lock-free code was necessary for performance in many cases.
Two processors, running in parallel, execute the following machine code
Processors don't "run in parallel." Among other things, the OS could be scheduling the threads on each CPU in any way it wants. I just can't think about the false hypothetical that the processors "run in parallel."
I'm in the process of going through the article more carefully to see if this erroneous way of thinking propogated. It seems like the author knows his stuff, though, so I'm guessing not.
However, one of the reasons I was uncomfortable with the setup (besides thread scheuling), is that I personally don't make any assumptions about cache coherency. To me, it could take thread 1 an arbitrary amount of time to see an update made to memory by thread 2, unless some primitive (like a semaphore... with "acquire and release semantics") is used.
Right? I mean, couldn't that account for what is happening just as much as instruction reordering? In theory, I'm fairly certain this is true. In practice, it depends on the specifics of cache coherency in x86-64; if anyone can comment on that, I'd appreciate it.
/FA[scu] configure assembly listing
(this is from CL 15.00.30729.207 from the WDK 7.1)
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6...
I'm getting it happening every 1600 iterations with their makefile and 2000 with -march=native. Very interesting.