I have a feeling that we will see a sharp rise of stories like this, now that ARM finds itself in more places which were previously mostly occupied by x86, and all the subtle race conditions that x86's memory model forgave actually start failing, in equally subtle ways.
[1] The conclusion for this particular audience was: Don't try to avoid synchronization primitives, or even invent your own. They were not system level nor high perf code programmers, so they had that luxury.
You made me wonder, because I definitely remember using Peterson's Algorithm, so I went back to my slides and turns out: I first showed the problem with x86, then indeed added an MFENCE at the right place, and then showed how that was not enough for ARM. So the point back then was to show how weaker memory models can bite you with the example of x86, and then to show how it can still bite you on ARM with its even weaker model (ARMv7 at that time, and C11 atomics aren't mentioned yet either, but their old OS-specific support is).
Makes me wonder if it's really a good idea in most cases to use, for example, the Rust parking_lot crate, which reimplements mutexes and RW locks. Besides the speed boost for uncontended RW locks, particularly on x64 Linux, what I really like about parking_lot is that a write lock can be downgraded to a read lock without letting go of the lock. But maybe I'd be better off sticking with the tried-and-true OS-provided lock implementations and finding another way to work around the lack of a downgrade option.
You do need both for the problem to happen: Without shared memory, there’s nothing to exploit. And with a single core only, you get time-sliced multithreading, which orders all operations.
My point is, that combination was a lot rarer in ARM land before people started doing serious server or desktop computing with those chips.
These outnumber x86+Cortex-A by probably a factor of 1,000.
I'm only somewhat joking. People need to understand these memory models if they intend on writing atomic operations in their software, even if they aren't currently targeting ARM platforms. In this era, it's absurdly easy to change an an LLVM compiler to target aarch64, and it will happen for plenty of software that was written without ever considering the differences in atomic behavior on this platform.
Lock-free doesn't mean that there is no synchronization. It is a way to synchronize memory access between threads from the start. It means that there is no additional locking to protect access to the shared resource - all read access is valid, from any number of simultaneous write accesses at least one succeeds (which is not true for some other algorithms like network exponential backoff).
Even on x86 the most common instruction you use is LOCK cmpxchg.
Usually memory reordering is purely artifact of the way CPUs access their private L1-cache.
People who write in C++ should technically _only_ be concerned with the C++ memory model, but x86 has let them be very lax and undisciplined with std::memory_order_relaxed. ARM has some alluring constructs that don't quite fit with the C++ model, which can tempt you to mix and match memory models for performance. All of this means trouble with atomics.
My read of the situation was that there's already potential for a double-read / double-write between when the spinlock returns and when the head/tail index is updated.
Turns out that I was missing something: there's only one producer thread, and only one consumer thread. If there were multiple of either, then this code would be more fundamentally broken.
That said: IMO the use of `new` in modern C++ (as is the case in the writer queue) is often a code smell, especially when std::make_unique would work just as well. Using a unique_ptr would obviate the first concern [0] about the copy constructor not being deleted.
(If we used unique_ptr consistently here, we might fix the scary platform-dependent leak in exchange for a likely segfault following a nullptr dereference.)
One other comment: the explanation in [1] is slightly incorrect:
> we receive back Result* pointers from the results queue rq, then wrap them in a std::unique_ptr and jam them into a vector.
We actually receive unique_ptrs from the results queue, then because, um, reasons (probably that we forgot that we made this a unique_ptr), we're wrapping them in another unique_ptr, which works because we're passing a temporary (well, prvalue in C++17) to unique_ptr's constructor -- while that looks like it might invoke the deleted copy-constructor, it's actually an instance of guaranteed copy elision. Also a bit weird to see, but not an issue of correctness.
[0] https://github.com/stong/how-to-exploit-a-double-free#0-inte...
[1] https://github.com/stong/how-to-exploit-a-double-free#2-rece...
Indeed. It's not safe under x86 either.
As a naive practitioner of modern C++, I'd love it if you could elaborate on this.
Using unique_ptr/make_unique() or shared_ptr/make_shared() automates lifetime management (obviates the need for a manual 'delete') and makes the ownership policy explicit. They also have appropriately defined copying behavior. For example:
struct Foo {
// lots of stuff here ...
};
struct A {
Foo* f = new Foo;
~A() { delete f; }
};
struct B {
std::unique_ptr<Foo> f = std::make_unique<Foo>();
// no need to define a dtor; the default dtor is fine
};
For the destructor and the default constructor, compilers will generate basically identical code for both A and B above. If you try to copy a B, the compiler won't let you because unique_ptr isn't copyable. However it won't stop you from copying an A, even though as written (using the default copy ctor) that's almost certainly a mistake and will likely result in a double free in ~A().unique_ptr forces you to think about your dependencies and when objects can / should be cleaned up.
Their code is unsafe even on x86. You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.
Their attempt to use "volatile" instead of memory barriers is not appropriate. It could easily cause problems on x86 platforms in just the same way that it could on ARM. "volatile" does not mean what you think it means; if you're using it for anything other than interacting with hardware registers in a device driver, you're almost certainly using it incorrectly.
You must use the correct memory barriers to protect the read/write of what they call "head" and "tail". Without them, the code is just wrong, no matter what the platform.
Another "correct" use of volatile is a hack to prevent compilers from optimizing away certain code. It's pretty rare to need that and often you can just use a lower optimization level (like the usual -O2) but sometimes you need -O3 / -Ofast or something and a strategic volatile type def to keep everything working.
A classic example is Kahan summation algorithim. At -O2 it's fine. At -O3 or higher it silently defeats the algorithm while appearing to work (you get a sum but without the error compensation). Defining the working vars as volatile makes it work again. This is noted in the wikipedia pseudocode with the comment "// Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers!"
https://en.wikipedia.org/wiki/Kahan_summation_algorithm
Of course -O3 might not be any faster anyway but that's another topic.
Do you have anything to actually support this statement or did you just assume “overly aggressive optimizing compilers” and “O3” are somehow linked?
Generally optimization levels may find more opportunities to exploit UB, but they do not change the semantics of the language, and all languages I’m familiar with define floating point as a non-associative operation because it’s not when you’re working with finite precision.
TLDR: Don’t use volatile unless you really know what you’re doing, and unless you know C/C++ really well, you probably do not. If anyone tells you to throw in a volatile to “make things work”, it’s most likely cargo curling bad advice (not always, but probably).
See [1] for example the implementation of smp_store_release and smp_load_acquire in the linux kernel (barrier() is just a compiler barrier and {READ,WRITE}_ONCE are a cast to volatile).
Volatile only prevents reordering of volatile statements (and IO), not all load and stores.
[1] https://elixir.bootlin.com/linux/latest/source/tools/arch/x8...
> You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.
I am not sure about this. From my understanding, on x86, given the absence of compiler reordering, processor reordering should not cause a problem for a single-reader-single-writer FIFO. Normally I just use atomics but I think in this specific instance it should still be okay anyways. Obviously it will not work on ARM.
From my testing if you compile the code on x86 with clang or gcc, the resulting binary is not vulnerable.
[1] see the linux kernel implementation of load acquire and store release on x86 for example.
But yes, `volatile` for what should be atomics is a clear code smell. I made quite a loud noise when I read "the code quality looks excellent" in the article.
One side of the queue is a peripheral like a serial port that needs to be fed/drained like clockwork to avoid losing data or glitching (e.g. via interrupts or DMA), and the other side is usually software running on the main thread, that wants to be able to work at its own pace and also go to sleep sometimes.
An SPSC queue fits this use-case nicely. James Munns has a fancy one written in Rust [1], and I have a ~100 line C template [2].
[1] https://github.com/jamesmunns/bbqueue
[2] https://gist.github.com/ohazi/40746a16c7fea4593bd0b664638d70...
They have to be tweaked when execution isn't guaranteed (by using memory barriers). TFA is about an exploit based on code that hasn't added the required memory barriers.
https://www.reitzen.com/post/temporal-fuzzing-01/ https://www.reitzen.com/post/temporal-fuzzing-02/
Next step are some lock free queues, although I haven't gotten around to publishing them!
Anyways, https://github.com/tokio-rs/loom is used by any serious library doing atomic ops/synchronization and it blew me away with how fast it can catch most bugs like this.
For example, Rust doesn't have any way to know that your chosen lock-free algorithm relies on Acquire-release semantics to perform as intended, and so if you write safe Rust to implement it with Relaxed ordering, it will compile, and run, and on x86-64 it will even work just fine because the cheap behaviour on x86-64 has Acquire-release semantics anyway. But on ARM your program doesn't work because ARM really does have a Relaxed mode and without Acquire-release what you've got is not the clever lock-free algorithm you intended after all.
However, if you don't even understand what Ordering is, and just try to implement the naive algorithm in Rust without Atomic operations that take an Ordering, Rust won't compile your program at all because it could race. So this way you are at least confronted with the fact that it's time to learn about Ordering if you want to implement this algorithm and if you pick Relaxed you can keep the resulting (safe) mess you made.
And atomics force you to specify an ordering on every access, which helps both the writer (forced to think about which ordering they need) and reviewer (by communicating intent).
Other tooling, like Jepsen, will interact with your program at a higher level.
Relevant quote from Jim Keller: You run this program a hundred times, it never runs the same way twice. Ever.
SCNR
There may be a typo in section 3:
> It will happily retire instruction 6 before instruction 5.
If memory serves, although instructions can execute out-of-order, they retire in-order (hence the "re-order buffer").
tl;dr: just use std::atomic.
[1] it is of course possible they are actually present in the original code and just omitted from the explanation for brevity
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p115...
The purpose of abolishing volatile isn't so much to reinforce that it's not intended for this sort of threading nonsense (indeed on Windows the MSVC guarantees mean it almost is intended for this sort of nonsense) but to make it explicit that "volatile variables" were never really a thing anyway by abolishing the volatile qualifier on variables.
The thing your hardware can actually do is almost exactly: https://doc.rust-lang.org/core/ptr/fn.read_volatile.html and https://doc.rust-lang.org/core/ptr/fn.write_volatile.html
And sure enough that's equivalent to what is proposed for C++ although not in just this one paper.
With "volatile variables" you can use compound assignment operators on the variable. What does that even mean? Nothing. It means nothing, it's gibberish, but you can do it and people do. They presumably thought it meant something and since it doesn't they were wrong. So, deprecate this and maybe they'll go read up on the subject.
You can also in C++ declare things that clearly aren't in the least bit volatile, as volatile anyway. C++ has volatile member variables, volatile member functions, volatile parameters... Any code that seems to rely on this probably doesn't do what the people who wrote it thought it does, run away.
It means exactly the same thing as on a normal variable, and it boggles the mind that people somehow not understand that. Given 'volatile int i', 'i++' means the exact same thing as 'i = i + 1'. Does that also not make any sense to you? If it does, can you explain why you believe they are different?
Volatile member functions and parameters make no sense, but volatile member variables most certainly do. And there is considerable pushback in the C++ community because this is a significant loss of compatibility with various C-headers used frequently in embedded applications. I wouldn't be surprised if the deprecated features will be reinstated in the language in the end.
I do sort of miss having a basic volatile (although I can write my own type somewhat effectively) just for benchmarking's sake sometimes.
https://news.ycombinator.com/item?id=28731534
https://mobile.twitter.com/ErrataRob/status/1331735383193903...
1) Emulating an ISA includes emulating its memory model. As saagarjha says, this means that Rosetta 2 must (and does) correctly implement total store ordering.
2) There are various ways to implement this. For emulators that include a binary translation layer (that is, that translate x86 opcodes into a sequence of ARM opcodes), one route is to generate the appropriate ARM memory barriers as part of the translation. Even with optimization to reduce the number of necessary barriers, though, this is expensive. Instead, as mmwelt mentions, Apple took an unusual route here. The Apple Silicon MMU can be configured on a per-page basis to use either the relaxed ARM memory model or the TSO x86 memory model. There is a performance cost at the hardware level for using TSO, and there is a cost in silicon area for supporting both; but from the point of view of Rosetta 2, all it has to do is mark x86-accessed pages as TSO and the hardware takes care of the details, no software memory barriers needed.
> You can also select multi-core settings, as shown here... These settings change the number of memory barriers used to synchronize memory accesses between cores in apps during emulation. Fast is the default mode, but the strict and very strict options will increase the number of barriers. This slows down the app, but reduces the risk of app errors. The single-core option removes all barriers but forces all app threads to run on a single core.
https://news.ycombinator.com/item?id=28732273
zamadatix's interprets this as Microsoft saying that by default, Windows on ARM runs x86 apps without x86 TSO, and turns on extra memory barriers using per-app compatibility settings. But if an app needs TSO but isn't in Windows's database, it will crash or silently corrupt data.
Because on x86 it is, no special barriers or instructions necessary.
mov [shared_data], 1
mov [release_flag], 1
That said, heuristics are used to speed it up. I would recommend not sharing values in the stack between threads for synchronisation for example.