Memory Reordering Caught in the Act (opens in new tab)

(preshing.com)

137 pointsjenhsun14y ago32 comments

32 comments

24 comments · 9 top-level

reitzensteinm14y ago· 5 in thread

Probably the most important broad point of the article:

"And there you have it! During this run, a memory reordering was detected approximately once every 6600 iterations. When I tested in Ubuntu on a Core 2 Duo E6300, the occurrences were even more rare. One begins to appreciate how subtle timing bugs can creep undetected into lock-free code."

If you were shipping software to millions of people, you could reasonably never see a specific race condition error that ends up occurring on many tens of thousands of machines.

And then when you start to consider interactions between infrequent errors...

bediger400014y ago

But now every programmer is supposed to take advantage of the multi-core processors needed to run modern operating systems, aren't they? This is going to lead to a number of prominent disasters.

pmjordan14y ago

I don't know about "disasters". For better or worse, lots of software ships with severe bugs that are evidently rare enough never to be fixed but do happen.

For example, every few months, OSX kernel panics on my MacBook Air, with no obvious trigger. The backtrace always implicates the AHCI driver for the SATA SSD. Still, Apple evidently don't receive enough crash reports for this particular bug to actually fix it, as it's happened since I first bought the Air in November 2010.

I have no idea whether it's a threading bug like the one in the article (I'm not about to run my system in single-core mode for months just to find out) or maybe a race condition with DMA or just a simple logic error that only rarely applies. It might even be specific to the exact SSD model and revision I ended up with in my device. But that's pretty much irrelevant - there certainly hasn't been a widespread outcry about it.

Personally, as a developer, I would definitely want to fix that kind of bug in anything I'd built. But tracking it down might take weeks of costly developer time. So Apple's bug triage probably marked this one "WONTFIX" after deciding it only happened on hardware they no longer sell, so fixing it wasn't going to have a positive ROI.

(FWIW, if it sounds like I have a personal vendetta against the AHCI driver, that's because I do. ;-) It behaves erratically in ways unrelated to this crash, but that weirdness doesn't cause kernel panics, just extra work for developers of other drivers.)

1 more reply

ajross14y ago

Not really. Memory ordering and cache coherency issues are endemic to race conditions, but race conditions are bad bugs even in the absence of fun tricks by the caches.

You "fix" this the same way you fix any race: you get your synchronization right. The instructions used to implement the mutex (e.g. lock cmpxchg) are "serializing", which means that all memory operations issued before them in the instruction stream will be completed and committed before the instruction issues.

scott_s14y ago

But not every programmer is going to implement their own lock-free algorithms (http://en.wikipedia.org/wiki/Non-blocking_algorithm). Experts write such algorithms, and other programmers will build their applications on top of those libraries. And even more programmers will use abstractions that aren't even libraries, but language-level abstractions.

1 more reply

keeperofdakeys14y ago

Most programs don't use anywhere near enough cpu to make parallel algorithms pay off. So you may as well keep it simple, especially if you aren't hitting the limits of your current algorithms. Simplicity means less code, which means less bugs.

1 more reply

scott_s14y ago· 4 in thread

On a related note, I compiled and ran this sample on Playstation 3, and no memory reordering was detected. This suggests that the two hardware threads inside the PPU effectively act as a single processor, with very fine-grained hardware scheduling.

This is because all of the cores on the Cell processor are in-order (http://en.wikipedia.org/wiki/Out-of-order_execution#In-order...). That fact is well documented: http://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf

And for the record, the "correct" thing to do is to always use the appropriate memory and instruction fences. Reasoning about where these should go can be subtle.

preshing14y ago

Actually, it's not because the PPU executes in-order, which has only to do with the way the CPU orders instructions internally and says nothing about memory ordering. For example, the Xbox 360 has in-order processors too, yet you can observe memory reordering all over the place. Both consoles use the PowerPC architecture, which is well-known to provide weak memory ordering.

pron14y ago

Correct. Memory operation re-ordering can occur due to the way the cache-coherence mechanism is implemented. If messages on the bus between cores are not ordered, memory operations can effectively be reordered even if each core never performs any reordering (e.g. due to cache misses).

This is a great explanation: http://irl.cs.ucla.edu/~yingdi/paperreading/whymb.2010.06.07...

Deregibus14y ago

The Xbox 360 and PS3 processors are very similar, one main difference is the PS3 (ignoring SPUs) is single-core with two hardware threads while the 360 is triple-core with two hardware threads per core. I would imagine that if you locked the affinity of the two threads to the two hardware threads of one of the 360's processors you would see the same behavior. IIRC, The two hardware threads on those CPUs are essentially just duplicate sets of registers that each have their own instruction stream, but execute the instructions in a single shared pipeline (the benefit being that you have more instructions available to fill the rather long CPU pipeline and are less likely to unused cycles while stalled on memory fetches). As far as the actual executing instructions are concerned there would be a single stream of instructions and thus no opportunity for the memory ordering effects that you might see with two distinct cores/processors.

maximilianburke14y ago

You can most definitely witness memory reordering on the PS3 when you have code on the PPU and SPUs operating on the same memory.

tocomment14y ago· 2 in thread

I hope this isn't too dumb, but I always imagined each processor having its own set of registers.

Am I incorrect? I can't see how multiple processors can share registers without chaos?

HeXetic14y ago

The registers aren't shared here, but the main memory is. The example has each thread write '1' into a [shared, main] memory location and then read the other thread's memory location into a register. If both CPUs' registers are 0, that means that both reads occurred before both writes.

astrange14y ago

Of course they have different registers, but they sure don't have different memory.

loboman14y ago· 1 in thread

So what's the surprise here? this is exactly why you use locks in the code, because you don't know what the compiler is going to do with your code, and you don't want to know all the details of that. The solution is not to know all those asm details, but to use tools like locks correctly.

edit: after writing lock-free code years ago, and having very subtle bugs like the ones described here, I don't think it is a good idea anymore, unless you are in a very specific scenario.

AlexandrB14y ago

> ...unless you are in a very specific scenario.

Locks can be very slow (relatively). At my last job using lock-free code was necessary for performance in many cases.

javert14y ago· 1 in thread

This whole article was problematic to me because it starts out with:

Two processors, running in parallel, execute the following machine code

Processors don't "run in parallel." Among other things, the OS could be scheduling the threads on each CPU in any way it wants. I just can't think about the false hypothetical that the processors "run in parallel."

I'm in the process of going through the article more carefully to see if this erroneous way of thinking propogated. It seems like the author knows his stuff, though, so I'm guessing not.

javert14y ago

I didn't find any problems.

However, one of the reasons I was uncomfortable with the setup (besides thread scheuling), is that I personally don't make any assumptions about cache coherency. To me, it could take thread 1 an arbitrary amount of time to see an update made to memory by thread 2, unless some primitive (like a semaphore... with "acquire and release semantics") is used.

Right? I mean, couldn't that account for what is happening just as much as instruction reordering? In theory, I'm fairly certain this is true. In practice, it depends on the specifics of cache coherency in x86-64; if anyone can comment on that, I'd appreciate it.

alpb14y ago· 1 in thread

This is a very nice post. I didn't use to know <code>-S</code> existed before.

malkia14y ago

On the MSVC side the option is -FA[scu]

/FA[scu] configure assembly listing

(this is from CL 15.00.30729.207 from the WDK 7.1)

mrushton1414y ago· 1 in thread

I do embedded Linux development and I see bugs related to this stuff frequently. What can make things more tricky is different architectures have different ordering guarantees. Anyone know of any cpu architectures that intentionally sacrifice performance for a simpler less bug prone model? I'm thinking of markets like aviation where bugs like this seem particularly scary.

duskwuff14y ago

There's no need for a separate architecture to avoid concurrency issues; a simpler solution is to simply use a uniprocessor system. (And be careful with interrupts.)

adobriyan14y ago

Somewhat more systematic document:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6...

simcop238714y ago

Apparently this happens alot more for me on an AMD bulldozer.

I'm getting it happening every 1600 iterations with their makefile and 2000 with -march=native. Very interesting.

j / k navigate · click thread line to collapse

32 comments

24 comments · 9 top-level

reitzensteinm14y ago· 5 in thread

Probably the most important broad point of the article:

If you were shipping software to millions of people, you could reasonably never see a specific race condition error that ends up occurring on many tens of thousands of machines.

And then when you start to consider interactions between infrequent errors...

bediger400014y ago

But now every programmer is supposed to take advantage of the multi-core processors needed to run modern operating systems, aren't they? This is going to lead to a number of prominent disasters.

pmjordan14y ago

I don't know about "disasters". For better or worse, lots of software ships with severe bugs that are evidently rare enough never to be fixed but do happen.

1 more reply

ajross14y ago

Not really. Memory ordering and cache coherency issues are endemic to race conditions, but race conditions are bad bugs even in the absence of fun tricks by the caches.

scott_s14y ago

1 more reply

keeperofdakeys14y ago

1 more reply

scott_s14y ago· 4 in thread

And for the record, the "correct" thing to do is to always use the appropriate memory and instruction fences. Reasoning about where these should go can be subtle.

preshing14y ago

pron14y ago

This is a great explanation: http://irl.cs.ucla.edu/~yingdi/paperreading/whymb.2010.06.07...

Deregibus14y ago

maximilianburke14y ago

You can most definitely witness memory reordering on the PS3 when you have code on the PPU and SPUs operating on the same memory.

tocomment14y ago· 2 in thread

I hope this isn't too dumb, but I always imagined each processor having its own set of registers.

Am I incorrect? I can't see how multiple processors can share registers without chaos?

HeXetic14y ago

astrange14y ago

Of course they have different registers, but they sure don't have different memory.

loboman14y ago· 1 in thread

edit: after writing lock-free code years ago, and having very subtle bugs like the ones described here, I don't think it is a good idea anymore, unless you are in a very specific scenario.

AlexandrB14y ago

> ...unless you are in a very specific scenario.

Locks can be very slow (relatively). At my last job using lock-free code was necessary for performance in many cases.

javert14y ago· 1 in thread

This whole article was problematic to me because it starts out with:

Two processors, running in parallel, execute the following machine code

I'm in the process of going through the article more carefully to see if this erroneous way of thinking propogated. It seems like the author knows his stuff, though, so I'm guessing not.

javert14y ago

I didn't find any problems.

alpb14y ago· 1 in thread

This is a very nice post. I didn't use to know <code>-S</code> existed before.

malkia14y ago

On the MSVC side the option is -FA[scu]

/FA[scu] configure assembly listing

(this is from CL 15.00.30729.207 from the WDK 7.1)

mrushton1414y ago· 1 in thread

duskwuff14y ago

There's no need for a separate architecture to avoid concurrency issues; a simpler solution is to simply use a uniprocessor system. (And be careful with interrupts.)

adobriyan14y ago

Somewhat more systematic document:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6...

simcop238714y ago

Apparently this happens alot more for me on an AMD bulldozer.

I'm getting it happening every 1600 iterations with their makefile and 2000 with -march=native. Very interesting.

j / k navigate · click thread line to collapse