A bug that doesn’t exist on x86: Exploiting an ARM-only race condition (opens in new tab)

(github.com)

291 pointsstong14y ago134 comments

134 comments

78 comments · 18 top-level

Either I'm not understanding something that I thought I understood very well, or TFA's author's don't understand something that they think they understand very well.

Their code is unsafe even on x86. You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.

Their attempt to use "volatile" instead of memory barriers is not appropriate. It could easily cause problems on x86 platforms in just the same way that it could on ARM. "volatile" does not mean what you think it means; if you're using it for anything other than interacting with hardware registers in a device driver, you're almost certainly using it incorrectly.

You must use the correct memory barriers to protect the read/write of what they call "head" and "tail". Without them, the code is just wrong, no matter what the platform.

kloch4y ago

> "volatile" does not mean what you think it means; if you're using it for anything other than interacting with hardware registers in a device driver, you're almost certainly using it incorrectly.

Another "correct" use of volatile is a hack to prevent compilers from optimizing away certain code. It's pretty rare to need that and often you can just use a lower optimization level (like the usual -O2) but sometimes you need -O3 / -Ofast or something and a strategic volatile type def to keep everything working.

A classic example is Kahan summation algorithim. At -O2 it's fine. At -O3 or higher it silently defeats the algorithm while appearing to work (you get a sum but without the error compensation). Defining the working vars as volatile makes it work again. This is noted in the wikipedia pseudocode with the comment "// Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers!"

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

Of course -O3 might not be any faster anyway but that's another topic.

vlovich1234y ago

I can’t imagine it’s an O2 vs O3 thing unless a compiler enables “fast-math” optimization to allow associativity. Neither clang nor GCC do this (neither does MSVC I think) - optimization levels never silently turn off IEEE754 floating point. I don’t know about ICC but it sounds like they stupidly enable fast math by default to try to win at benchmarks.

Do you have anything to actually support this statement or did you just assume “overly aggressive optimizing compilers” and “O3” are somehow linked?

Generally optimization levels may find more opportunities to exploit UB, but they do not change the semantics of the language, and all languages I’m familiar with define floating point as a non-associative operation because it’s not when you’re working with finite precision.

TLDR: Don’t use volatile unless you really know what you’re doing, and unless you know C/C++ really well, you probably do not. If anyone tells you to throw in a volatile to “make things work”, it’s most likely cargo curling bad advice (not always, but probably).

2 more replies

dataangel4y ago

I think their point is you only need compiler barriers not actual barrier instructions on x86. volatile in practice has been the de facto way to get the effect of a compiler memory barrier for a long time even though it's not the best way to do it nowadays. The original purpose of it is literally preventing the compiler from getting rid of loads and stores and reordering them which is exactly what is needed when implementing a lockless FIFO. As long as all the stores and loads (including the actual FIFO payload) are volatile, it will work (volatile loads and stores are guaranteed to not be reordered with each other). After that the x86 guarantees about not reordering are very strong. Really the best argument against volatile for this kind of thing is actually the opposite of your point, volatile is too strong. It prevents more reordering than you actually want. Acquire/release semantics are less strong and give the compiler more flexibility.

gpderetta4y ago

volatile does not work as a memory barrier neither in theory nor in practice. On gcc for example you need an explicit additional compiler barrier before a volatile store and after a volatile load to implement the expected release/acquire semantics on x86.

See [1] for example the implementation of smp_store_release and smp_load_acquire in the linux kernel (barrier() is just a compiler barrier and {READ,WRITE}_ONCE are a cast to volatile).

Volatile only prevents reordering of volatile statements (and IO), not all load and stores.

[1] https://elixir.bootlin.com/linux/latest/source/tools/arch/x8...

hvdijk4y ago

The point is that the bug is unexploitable on x86 because although the source code may have a bug, on x86 it gets compiled to machine code that does not. That's the thing with undefined behaviour, sometimes it does work exactly as you expect, which can make it so tricky to nail down.

stong1OP4y ago

Right. The challenge is written incorrectly on purpose, otherwise the code isn't vulnerable. The use of volatile is a bit of a misdirection for the CTF players, since you're right that it's a common misconception that volatile acts like a barrier.

> You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.

I am not sure about this. From my understanding, on x86, given the absence of compiler reordering, processor reordering should not cause a problem for a single-reader-single-writer FIFO. Normally I just use atomics but I think in this specific instance it should still be okay anyways. Obviously it will not work on ARM.

From my testing if you compile the code on x86 with clang or gcc, the resulting binary is not vulnerable.

gpderetta4y ago

Without compiler fences in the right place [1] GCC and clang can miscompile the code even on x86. Doesn't mean they will of course.

[1] see the linux kernel implementation of load acquire and store release on x86 for example.

scatters4y ago

Another place it's meaningful to use `volatile` is in benchmarking and testing: to either ensure that a block of code is run despite not having any side effects, or to ensure that a block of code that should not be run is still compiled and emitted to binary.

But yes, `volatile` for what should be atomics is a clear code smell. I made quite a loud noise when I read "the code quality looks excellent" in the article.

leeter4y ago

Well their use is definitely UB as it creates data races. Godbolt to the rescue... https://godbolt.org/z/3rsK6n31z

gpderetta4y ago

As noted elsethread, the code is indeed wrong even on x86, although you only need compiler fences there.

anyfoo4y ago· 8 in thread

Heh, 10 years ago I gave a presentation about how easy folks used to x86 can trip up when dealing with ARM's weaker memory model. My demonstration then was with a naive implementation of Peterson's algorithm.[1]

I have a feeling that we will see a sharp rise of stories like this, now that ARM finds itself in more places which were previously mostly occupied by x86, and all the subtle race conditions that x86's memory model forgave actually start failing, in equally subtle ways.

[1] The conclusion for this particular audience was: Don't try to avoid synchronization primitives, or even invent your own. They were not system level nor high perf code programmers, so they had that luxury.

gpderetta4y ago

But Peterson's algorithm requires explicit memory barriers even on x86, it doesn't seem the best example to show the difference.

anyfoo4y ago

Here are my slides from back then: https://reinference.net/mp-talk.pdf

You made me wonder, because I definitely remember using Peterson's Algorithm, so I went back to my slides and turns out: I first showed the problem with x86, then indeed added an MFENCE at the right place, and then showed how that was not enough for ARM. So the point back then was to show how weaker memory models can bite you with the example of x86, and then to show how it can still bite you on ARM with its even weaker model (ARMv7 at that time, and C11 atomics aren't mentioned yet either, but their old OS-specific support is).

2 more replies

mwcampbell4y ago

> Don't try to avoid synchronization primitives, or even invent your own.

Makes me wonder if it's really a good idea in most cases to use, for example, the Rust parking_lot crate, which reimplements mutexes and RW locks. Besides the speed boost for uncontended RW locks, particularly on x64 Linux, what I really like about parking_lot is that a write lock can be downgraded to a read lock without letting go of the lock. But maybe I'd be better off sticking with the tried-and-true OS-provided lock implementations and finding another way to work around the lack of a downgrade option.

cesarb4y ago

Unless you're the maintainer of the parking_lot crate, you're not "inventing your own". And since parking_lot is AFAIK the second most popular implementation of mutexes and RW locks in Rust (the most popular one being obviously the one in the Rust standard library, which wraps the OS-provided lock implementations), you can assume it's well tested.

1 more reply

stjohnswarts4y ago

I doubt that. The number of ARM processors is far greater in reality than in x86 if we clarify it by saying “in operation” rather than historically and these stories will become more common but certainly won't see a “sharp increase”.

codeflo4y ago

This sort of bug only happens when running a multithreaded program (with shared memory) on a multicore processor.

You do need both for the problem to happen: Without shared memory, there’s nothing to exploit. And with a single core only, you get time-sliced multithreading, which orders all operations.

My point is, that combination was a lot rarer in ARM land before people started doing serious server or desktop computing with those chips.

retrac4y ago

Of course. Any such flaws in the Linux kernel or any library used by Android should have been found by now, for example. But the number of ARM processors running developer/server/desktop stacks has been tiny until recently. In my experience, quite a lot of Linux on desktop software fails to even build on non x86_64 machines.

1 more reply

SavantIdiot4y ago

The dominant Arm core in the world is a Cortex-M (or Cortex-R) which are single-core. They are 99% of the time on a die with far less <512K SRAM, and run an RToS or baremetal.

These outnumber x86+Cortex-A by probably a factor of 1,000.

1 more reply

im3w1l4y ago· 8 in thread

And arm-windows will (does already?) run x86 binaries with weaker memory ordering than they were written for. So this could be a real thing soon.

kevingadd4y ago

Are you sure the translators don't insert code necessary to maintain ordering? I would be shocked if most threaded code works when you throw out the x86 memory model. Managed runtimes like .NET definitely generate code for each target designed to maintain the correct memory model.

nyanpasu644y ago

https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on...

> You can also select multi-core settings, as shown here... These settings change the number of memory barriers used to synchronize memory accesses between cores in apps during emulation. Fast is the default mode, but the strict and very strict options will increase the number of barriers. This slows down the app, but reduces the risk of app errors. The single-core option removes all barriers but forces all app threads to run on a single core.

https://news.ycombinator.com/item?id=28732273

zamadatix's interprets this as Microsoft saying that by default, Windows on ARM runs x86 apps without x86 TSO, and turns on extra memory barriers using per-app compatibility settings. But if an app needs TSO but isn't in Windows's database, it will crash or silently corrupt data.

pmuderoc4y ago

They better do, but then, how would an automatic translator know that this is a "release semantics" atomic store operation?

Because on x86 it is, no special barriers or instructions necessary.

mov [shared_data], 1

mov [release_flag], 1

1 more reply

my1234y ago

Maintaining the memory model guarantees is what causes the steep cost in performance when using x86 apps on Windows on Arm.

That said, heuristics are used to speed it up. I would recommend not sharing values in the stack between threads for synchronisation for example.

xxs4y ago

Normally the code should have all the needed memory fences as if running on DEC Alpha, e.g. linux does that, and the compilers omit the unneeded ones.

monocasa4y ago

And since the compiler omitted it on x86, an x86 emulator doesn't have access to where they're required as seen by the compiler.

1 more reply

belter4y ago

Now I am worried. Do you have a reference please?

im3w1l4y ago

Best I could find. It's not a great reference because it doesn't give any details but it does prove that it's a thing. https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on...

beebmam4y ago· 7 in thread

Like quantum physics, memory ordering is deeply unintuitive (on platforms like ARM). Unlike quantum physics, which is an unfortunate immutable fact of the universe, we got ourselves into this mess and we have no one to blame but ourselves for it.

I'm only somewhat joking. People need to understand these memory models if they intend on writing atomic operations in their software, even if they aren't currently targeting ARM platforms. In this era, it's absurdly easy to change an an LLVM compiler to target aarch64, and it will happen for plenty of software that was written without ever considering the differences in atomic behavior on this platform.

newpavlov4y ago

Memory ordering gets somewhat easier after you understand that flat memory shared by execution units is a leaky abstraction desperately patched over decades by layer and layers of hardware and software. Memory ordering is one way to represent message passing and synchronization between different cores and RAM. This why I think that "lock-free algorithms" is a misnomer, you still have synchronization, but you simply rely on hardware for it.

dooglius4y ago

Hardware synchronization has better properties than software locks: it can't deadlock, is reentrant, won't get screwed up by a process holding a lock dying, and is (supposedly) guaranteed to complete in bounded time. I don't think it's unreasonable that the definition of lock-free ("guaranteed system-wide progress") focuses on the high-level behavior of typical software locks even if it ends up calling things that are still locks in some sense "lock-free".

1 more reply

uuidgen4y ago

> "lock-free algorithms" is a misnomer, you still have synchronization,

Lock-free doesn't mean that there is no synchronization. It is a way to synchronize memory access between threads from the start. It means that there is no additional locking to protect access to the shared resource - all read access is valid, from any number of simultaneous write accesses at least one succeeds (which is not true for some other algorithms like network exponential backoff).

Even on x86 the most common instruction you use is LOCK cmpxchg.

gpderetta4y ago

That's actually a common misconception. Memory ordering, on the majority of common cpus, has nothing to do with interprocessor communication or processor-ram communication. Common memory coherency protocols (I.e. MESI and derivatives) guarantee that all caches have a consistent view of memory.

Usually memory reordering is purely artifact of the way CPUs access their private L1-cache.

1 more reply

rendaw4y ago

Is there anything out there that exposes a better or tighter abstraction? Something not flat?

3 more replies

pclmulqdq4y ago

One piece of friction that hurts here is that the C++/Rust and ARM memory models aren't the same, and the consequences of this are unintuitive - compilers and CPUs can both screw with execution ordering.

People who write in C++ should technically _only_ be concerned with the C++ memory model, but x86 has let them be very lax and undisciplined with std::memory_order_relaxed. ARM has some alluring constructs that don't quite fit with the C++ model, which can tempt you to mix and match memory models for performance. All of this means trouble with atomics.

searealist4y ago

ARMv8 basically exactly mirrors the C++ memory model without any explicit memory orderings (the default on atomics being sequentially consistent).

1 more reply

vitus4y ago· 6 in thread

I spent some time trying to figure out why the lock-free read/write implementation is correct under x86, assuming a multiprocessor environment.

My read of the situation was that there's already potential for a double-read / double-write between when the spinlock returns and when the head/tail index is updated.

Turns out that I was missing something: there's only one producer thread, and only one consumer thread. If there were multiple of either, then this code would be more fundamentally broken.

That said: IMO the use of `new` in modern C++ (as is the case in the writer queue) is often a code smell, especially when std::make_unique would work just as well. Using a unique_ptr would obviate the first concern [0] about the copy constructor not being deleted.

(If we used unique_ptr consistently here, we might fix the scary platform-dependent leak in exchange for a likely segfault following a nullptr dereference.)

One other comment: the explanation in [1] is slightly incorrect:

> we receive back Result* pointers from the results queue rq, then wrap them in a std::unique_ptr and jam them into a vector.

We actually receive unique_ptrs from the results queue, then because, um, reasons (probably that we forgot that we made this a unique_ptr), we're wrapping them in another unique_ptr, which works because we're passing a temporary (well, prvalue in C++17) to unique_ptr's constructor -- while that looks like it might invoke the deleted copy-constructor, it's actually an instance of guaranteed copy elision. Also a bit weird to see, but not an issue of correctness.

[0] https://github.com/stong/how-to-exploit-a-double-free#0-inte...

[1] https://github.com/stong/how-to-exploit-a-double-free#2-rece...

PaulDavisThe1st4y ago

> Turns out that I was missing something:

Indeed. It's not safe under x86 either.

stong1OP4y ago

Great points. I made some minor edits to address that and clarify some things. Thanks!

aydwi4y ago

> IMO the use of `new` in modern C++ (as is the case in the writer queue) is often a code smell

As a naive practitioner of modern C++, I'd love it if you could elaborate on this.

usefulcat4y ago

Whenever you use 'new', you have to decide what is going to 'own' the newly allocated thing. You'll also have to remember to call 'delete' somewhere.

Using unique_ptr/make_unique() or shared_ptr/make_shared() automates lifetime management (obviates the need for a manual 'delete') and makes the ownership policy explicit. They also have appropriately defined copying behavior. For example:

    struct Foo {
        // lots of stuff here ...
    };

    struct A {
        Foo* f = new Foo;
        ~A() { delete f; }
    };
    
    struct B {
        std::unique_ptr<Foo> f = std::make_unique<Foo>();
        // no need to define a dtor; the default dtor is fine
    };

For the destructor and the default constructor, compilers will generate basically identical code for both A and B above. If you try to copy a B, the compiler won't let you because unique_ptr isn't copyable. However it won't stop you from copying an A, even though as written (using the default copy ctor) that's almost certainly a mistake and will likely result in a double free in ~A().

vitus4y ago

Exactly as usefulcat points out: with modern C++, the bulk of object lifetimes should be handled with unique_ptr and/or directly on the stack, so your destructors are automatically called for you (reducing the risk of double frees or memory leaks).

unique_ptr forces you to think about your dependencies and when objects can / should be cleaned up.

ddlutz4y ago

It's widely accepted you should almost always use some smart pointer like unique_ptr or shared_ptr, and even sometimes the the unpopular weak_ptr depending on your use case.

pcwalton4y ago· 6 in thread

Lock-free programming is really tough. There are really only a few patterns that work (e.g. Treiber stack). Trying to invent a new lock-free algorithm, as this vulnerable code demonstrates, almost always ends in tears.

nyanpasu644y ago

IMO lock-free MP or MC algorithms are harder to get right than SPSC structures (atomics for shared memory, queues for messaging, triple buffers for tear-free shared memory). But even SPSC algorithms can be tricky; I've found the same (theoretical) ordering error in three separate Rust implementations of triple buffering (one of them mine), written by people who've already learned the ordering rules (which I caught with Loom). And initially learning to reason about memory ordering is a major upfront challenge too.

ohazi4y ago

I particularly like lock-free (wait-free?) SPSC queues because they're (relatively) easy to get right, and are extremely useful for buffering in embedded systems. I end up with something like this on almost every project:

One side of the queue is a peripheral like a serial port that needs to be fed/drained like clockwork to avoid losing data or glitching (e.g. via interrupts or DMA), and the other side is usually software running on the main thread, that wants to be able to work at its own pace and also go to sleep sometimes.

An SPSC queue fits this use-case nicely. James Munns has a fancy one written in Rust [1], and I have a ~100 line C template [2].

[1] https://github.com/jamesmunns/bbqueue

[2] https://gist.github.com/ohazi/40746a16c7fea4593bd0b664638d70...

reitzensteinm4y ago

I'd be interested in knowing the details of the error!

1 more reply

platinumrad4y ago

There's no new invention in here. Just an (intentional) misuse of "volatile".

xxs4y ago

There are tons of lock-free algorithms, both node based and array backed up. Lock-free is notoriously easier on garbage collector set-ups, of course.

PaulDavisThe1st4y ago

This isn't a new lock-free algorithm. Single-reader, single-write FIFOs are one of the oldest approaches around.

They have to be tweaked when execution isn't guaranteed (by using memory barriers). TFA is about an exploit based on code that hasn't added the required memory barriers.

Azsy4y ago· 4 in thread

Have i told you about our lord and savior Rust?

Anyways, https://github.com/tokio-rs/loom is used by any serious library doing atomic ops/synchronization and it blew me away with how fast it can catch most bugs like this.

nyanpasu644y ago

Rust doesn't catch memory ordering errors, which can result in behavioral bugs in safe Rust and data races and memory unsafety in unsafe Rust. But Loom is an excellent tool for catching ordering errors, though its UnsafeCell API differs from std's (and worse yet, some people report Loom returns false positives/negatives in some cases: https://github.com/tokio-rs/loom/issues/180, possibly https://github.com/tokio-rs/loom/issues/166).

tialaramex4y ago

> which can result in behavioral bugs in safe Rust

For example, Rust doesn't have any way to know that your chosen lock-free algorithm relies on Acquire-release semantics to perform as intended, and so if you write safe Rust to implement it with Relaxed ordering, it will compile, and run, and on x86-64 it will even work just fine because the cheap behaviour on x86-64 has Acquire-release semantics anyway. But on ARM your program doesn't work because ARM really does have a Relaxed mode and without Acquire-release what you've got is not the clever lock-free algorithm you intended after all.

However, if you don't even understand what Ordering is, and just try to implement the naive algorithm in Rust without Atomic operations that take an Ordering, Rust won't compile your program at all because it could race. So this way you are at least confronted with the fact that it's time to learn about Ordering if you want to implement this algorithm and if you pick Relaxed you can keep the resulting (safe) mess you made.

CodesInChaos4y ago

It doesn't catch all of them. But data-races on plain memory access are impossible in safe rust.

And atomics force you to specify an ordering on every access, which helps both the writer (forced to think about which ordering they need) and reviewer (by communicating intent).

Fiahil4y ago

I think it's fixable, the main reactor is what matters. You can add or remove as many synchronisation primitive as you like.

Other tooling, like Jepsen, will interact with your program at a higher level.

secondcoming4y ago· 3 in thread

There is a proposal (possibly accepted) to deprecate 'volatile' in C++.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p115...

tialaramex4y ago

Yes P1152 was taken for C++ 20.

The purpose of abolishing volatile isn't so much to reinforce that it's not intended for this sort of threading nonsense (indeed on Windows the MSVC guarantees mean it almost is intended for this sort of nonsense) but to make it explicit that "volatile variables" were never really a thing anyway by abolishing the volatile qualifier on variables.

The thing your hardware can actually do is almost exactly: https://doc.rust-lang.org/core/ptr/fn.read_volatile.html and https://doc.rust-lang.org/core/ptr/fn.write_volatile.html

And sure enough that's equivalent to what is proposed for C++ although not in just this one paper.

With "volatile variables" you can use compound assignment operators on the variable. What does that even mean? Nothing. It means nothing, it's gibberish, but you can do it and people do. They presumably thought it meant something and since it doesn't they were wrong. So, deprecate this and maybe they'll go read up on the subject.

You can also in C++ declare things that clearly aren't in the least bit volatile, as volatile anyway. C++ has volatile member variables, volatile member functions, volatile parameters... Any code that seems to rely on this probably doesn't do what the people who wrote it thought it does, run away.

agent3274y ago

>With "volatile variables" you can use compound assignment operators on the variable. What does that even mean? Nothing.

It means exactly the same thing as on a normal variable, and it boggles the mind that people somehow not understand that. Given 'volatile int i', 'i++' means the exact same thing as 'i = i + 1'. Does that also not make any sense to you? If it does, can you explain why you believe they are different?

Volatile member functions and parameters make no sense, but volatile member variables most certainly do. And there is considerable pushback in the C++ community because this is a significant loss of compatibility with various C-headers used frequently in embedded applications. I wouldn't be surprised if the deprecated features will be reinstated in the language in the end.

3 more replies

mhh__4y ago

volatile primitives is how D does volatile as well.

I do sort of miss having a basic volatile (although I can write my own type somewhat effectively) just for benchmarking's sake sometimes.

amelius4y ago· 3 in thread

Does the race condition exist when emulating x86 on Apple M1?

mmwelt4y ago

Apple added hardware support for x86 memory semantics.

https://news.ycombinator.com/item?id=28731534

https://mobile.twitter.com/ErrataRob/status/1331735383193903...

saagarjha4y ago

No. Rosetta emulates TSO correctly.

addaon4y ago

To draw together the two answers here to the original question.

1) Emulating an ISA includes emulating its memory model. As saagarjha says, this means that Rosetta 2 must (and does) correctly implement total store ordering.

2) There are various ways to implement this. For emulators that include a binary translation layer (that is, that translate x86 opcodes into a sequence of ARM opcodes), one route is to generate the appropriate ARM memory barriers as part of the translation. Even with optimization to reduce the number of necessary barriers, though, this is expensive. Instead, as mmwelt mentions, Apple took an unusual route here. The Apple Silicon MMU can be configured on a per-page basis to use either the relaxed ARM memory model or the TSO x86 memory model. There is a performance cost at the hardware level for using TSO, and there is a cost in silicon area for supporting both; but from the point of view of Rosetta 2, all it has to do is mark x86-accessed pages as TSO and the hardware takes care of the details, no software memory barriers needed.

1 more reply

silisili4y ago· 2 in thread

> Nowadays, high-performance processors, like those found in desktops, servers, and phones, are massively out-of-order to exploit instruction-level parallelism as much as possible. They perform all sorts of tricks to improve performance.

Relevant quote from Jim Keller: You run this program a hundred times, it never runs the same way twice. Ever.

krylon4y ago

Heraclitus, mumbling into his beard: "Told you so!"

SCNR

vlovich1234y ago

A hundred times is not that much except for really cold code paths. It’s probably in the billions if not more and I have to imagine that software level effects typically swamp HW-level effects here. That’s why you see software typically having a performance deviation no greater than ~5-10% unless you’re running microbenchmarks.

agalunar4y ago· 2 in thread

Great write-up!

There may be a typo in section 3:

> It will happily retire instruction 6 before instruction 5.

If memory serves, although instructions can execute out-of-order, they retire in-order (hence the "re-order buffer").

colejohnson664y ago

You are correct. The retire unit ensures that all micro ops are retired in order

stong1OP4y ago

Nice catch. I fixed it. I should have said "execute" rather than "retire".

half-kh-hacker4y ago· 1 in thread

this slaps. I always see perfect blue a few places above us!

nyanpasu644y ago

Context for downvoters: "perfect blue" is the CTF group writing this article, and "a few places" means CTF team rankings in competitions.

reitzensteinm4y ago

For those interested in memory ordering, I have a few posts on my blog where I build a simulator capable of understanding reorderings and analyze examples with it:

https://www.reitzen.com/post/temporal-fuzzing-01/ https://www.reitzen.com/post/temporal-fuzzing-02/

Next step are some lock free queues, although I haven't gotten around to publishing them!

0xfaded4y ago

My first gen threadripper occasionally deadlocks in futex code within libgomp (gnu implementation of omp). Eventually I gave up and concluded it was either a hardware bug or a bug that incorrectly relies on atomic behaviour of intel CPUs. I eventually switched to using clang with its own omp implementation and the problem magically disappeared.

gpderetta4y ago

The best part is that the original code is not safe even on x86 as the compiler can still reorder non-volatile accesses to the backing_buf around the volatile accesses to head and tails. Compiler barriers before the volatile stores and after volatile reads are required [1]. It would still be very questionable code, but it would at least have a chance to work on its intended target.

tl;dr: just use std::atomic.

[1] it is of course possible they are actually present in the original code and just omitted from the explanation for brevity

cookiewill4y ago

Is it normal for the .got.plt section to be writable rather than read-only?

sydthrowaway4y ago

Any good references on low level details on ARMv8+?

drcongo4y ago

Nice try Intel.

j / k navigate · click thread line to collapse

134 comments

78 comments · 18 top-level

PaulDavisThe1st4y ago· 10 in thread

Either I'm not understanding something that I thought I understood very well, or TFA's author's don't understand something that they think they understand very well.

Their code is unsafe even on x86. You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.

You must use the correct memory barriers to protect the read/write of what they call "head" and "tail". Without them, the code is just wrong, no matter what the platform.

kloch4y ago

> "volatile" does not mean what you think it means; if you're using it for anything other than interacting with hardware registers in a device driver, you're almost certainly using it incorrectly.

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

Of course -O3 might not be any faster anyway but that's another topic.

vlovich1234y ago

Do you have anything to actually support this statement or did you just assume “overly aggressive optimizing compilers” and “O3” are somehow linked?

2 more replies

dataangel4y ago

gpderetta4y ago

See [1] for example the implementation of smp_store_release and smp_load_acquire in the linux kernel (barrier() is just a compiler barrier and {READ,WRITE}_ONCE are a cast to volatile).

Volatile only prevents reordering of volatile statements (and IO), not all load and stores.

[1] https://elixir.bootlin.com/linux/latest/source/tools/arch/x8...

hvdijk4y ago

stong1OP4y ago

> You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.

From my testing if you compile the code on x86 with clang or gcc, the resulting binary is not vulnerable.

gpderetta4y ago

Without compiler fences in the right place [1] GCC and clang can miscompile the code even on x86. Doesn't mean they will of course.

[1] see the linux kernel implementation of load acquire and store release on x86 for example.

scatters4y ago

But yes, `volatile` for what should be atomics is a clear code smell. I made quite a loud noise when I read "the code quality looks excellent" in the article.

leeter4y ago

Well their use is definitely UB as it creates data races. Godbolt to the rescue... https://godbolt.org/z/3rsK6n31z

gpderetta4y ago

As noted elsethread, the code is indeed wrong even on x86, although you only need compiler fences there.

anyfoo4y ago· 8 in thread

gpderetta4y ago

But Peterson's algorithm requires explicit memory barriers even on x86, it doesn't seem the best example to show the difference.

anyfoo4y ago

Here are my slides from back then: https://reinference.net/mp-talk.pdf

2 more replies

mwcampbell4y ago

> Don't try to avoid synchronization primitives, or even invent your own.

cesarb4y ago

1 more reply

stjohnswarts4y ago

codeflo4y ago

This sort of bug only happens when running a multithreaded program (with shared memory) on a multicore processor.

You do need both for the problem to happen: Without shared memory, there’s nothing to exploit. And with a single core only, you get time-sliced multithreading, which orders all operations.

My point is, that combination was a lot rarer in ARM land before people started doing serious server or desktop computing with those chips.

retrac4y ago

1 more reply

SavantIdiot4y ago

The dominant Arm core in the world is a Cortex-M (or Cortex-R) which are single-core. They are 99% of the time on a die with far less <512K SRAM, and run an RToS or baremetal.

These outnumber x86+Cortex-A by probably a factor of 1,000.

1 more reply

im3w1l4y ago· 8 in thread

And arm-windows will (does already?) run x86 binaries with weaker memory ordering than they were written for. So this could be a real thing soon.

kevingadd4y ago

nyanpasu644y ago

https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on...

https://news.ycombinator.com/item?id=28732273

pmuderoc4y ago

They better do, but then, how would an automatic translator know that this is a "release semantics" atomic store operation?

Because on x86 it is, no special barriers or instructions necessary.

mov [shared_data], 1

mov [release_flag], 1

1 more reply

my1234y ago

Maintaining the memory model guarantees is what causes the steep cost in performance when using x86 apps on Windows on Arm.

That said, heuristics are used to speed it up. I would recommend not sharing values in the stack between threads for synchronisation for example.

xxs4y ago

Normally the code should have all the needed memory fences as if running on DEC Alpha, e.g. linux does that, and the compilers omit the unneeded ones.

monocasa4y ago

And since the compiler omitted it on x86, an x86 emulator doesn't have access to where they're required as seen by the compiler.

1 more reply

belter4y ago

Now I am worried. Do you have a reference please?

im3w1l4y ago

Best I could find. It's not a great reference because it doesn't give any details but it does prove that it's a thing. https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on...

beebmam4y ago· 7 in thread

newpavlov4y ago

dooglius4y ago

1 more reply

uuidgen4y ago

> "lock-free algorithms" is a misnomer, you still have synchronization,

Even on x86 the most common instruction you use is LOCK cmpxchg.

gpderetta4y ago

Usually memory reordering is purely artifact of the way CPUs access their private L1-cache.

1 more reply

rendaw4y ago

Is there anything out there that exposes a better or tighter abstraction? Something not flat?

3 more replies

pclmulqdq4y ago

searealist4y ago

ARMv8 basically exactly mirrors the C++ memory model without any explicit memory orderings (the default on atomics being sequentially consistent).

1 more reply

vitus4y ago· 6 in thread

I spent some time trying to figure out why the lock-free read/write implementation is correct under x86, assuming a multiprocessor environment.

My read of the situation was that there's already potential for a double-read / double-write between when the spinlock returns and when the head/tail index is updated.

Turns out that I was missing something: there's only one producer thread, and only one consumer thread. If there were multiple of either, then this code would be more fundamentally broken.

(If we used unique_ptr consistently here, we might fix the scary platform-dependent leak in exchange for a likely segfault following a nullptr dereference.)

One other comment: the explanation in [1] is slightly incorrect:

> we receive back Result* pointers from the results queue rq, then wrap them in a std::unique_ptr and jam them into a vector.

[0] https://github.com/stong/how-to-exploit-a-double-free#0-inte...

[1] https://github.com/stong/how-to-exploit-a-double-free#2-rece...

PaulDavisThe1st4y ago

> Turns out that I was missing something:

Indeed. It's not safe under x86 either.

stong1OP4y ago

Great points. I made some minor edits to address that and clarify some things. Thanks!

aydwi4y ago

> IMO the use of `new` in modern C++ (as is the case in the writer queue) is often a code smell

As a naive practitioner of modern C++, I'd love it if you could elaborate on this.

usefulcat4y ago

Whenever you use 'new', you have to decide what is going to 'own' the newly allocated thing. You'll also have to remember to call 'delete' somewhere.

    struct Foo {
        // lots of stuff here ...
    };

    struct A {
        Foo* f = new Foo;
        ~A() { delete f; }
    };
    
    struct B {
        std::unique_ptr<Foo> f = std::make_unique<Foo>();
        // no need to define a dtor; the default dtor is fine
    };

vitus4y ago

unique_ptr forces you to think about your dependencies and when objects can / should be cleaned up.

ddlutz4y ago

It's widely accepted you should almost always use some smart pointer like unique_ptr or shared_ptr, and even sometimes the the unpopular weak_ptr depending on your use case.

pcwalton4y ago· 6 in thread

nyanpasu644y ago

ohazi4y ago

An SPSC queue fits this use-case nicely. James Munns has a fancy one written in Rust [1], and I have a ~100 line C template [2].

[1] https://github.com/jamesmunns/bbqueue

[2] https://gist.github.com/ohazi/40746a16c7fea4593bd0b664638d70...

reitzensteinm4y ago

I'd be interested in knowing the details of the error!

1 more reply

platinumrad4y ago

There's no new invention in here. Just an (intentional) misuse of "volatile".

xxs4y ago

There are tons of lock-free algorithms, both node based and array backed up. Lock-free is notoriously easier on garbage collector set-ups, of course.

PaulDavisThe1st4y ago

This isn't a new lock-free algorithm. Single-reader, single-write FIFOs are one of the oldest approaches around.

They have to be tweaked when execution isn't guaranteed (by using memory barriers). TFA is about an exploit based on code that hasn't added the required memory barriers.

Azsy4y ago· 4 in thread

Have i told you about our lord and savior Rust?

Anyways, https://github.com/tokio-rs/loom is used by any serious library doing atomic ops/synchronization and it blew me away with how fast it can catch most bugs like this.

nyanpasu644y ago

tialaramex4y ago

> which can result in behavioral bugs in safe Rust

CodesInChaos4y ago

It doesn't catch all of them. But data-races on plain memory access are impossible in safe rust.

And atomics force you to specify an ordering on every access, which helps both the writer (forced to think about which ordering they need) and reviewer (by communicating intent).

Fiahil4y ago

I think it's fixable, the main reactor is what matters. You can add or remove as many synchronisation primitive as you like.

Other tooling, like Jepsen, will interact with your program at a higher level.

secondcoming4y ago· 3 in thread

There is a proposal (possibly accepted) to deprecate 'volatile' in C++.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p115...

tialaramex4y ago

Yes P1152 was taken for C++ 20.

The thing your hardware can actually do is almost exactly: https://doc.rust-lang.org/core/ptr/fn.read_volatile.html and https://doc.rust-lang.org/core/ptr/fn.write_volatile.html

And sure enough that's equivalent to what is proposed for C++ although not in just this one paper.

agent3274y ago

>With "volatile variables" you can use compound assignment operators on the variable. What does that even mean? Nothing.

3 more replies

mhh__4y ago

volatile primitives is how D does volatile as well.

I do sort of miss having a basic volatile (although I can write my own type somewhat effectively) just for benchmarking's sake sometimes.

amelius4y ago· 3 in thread

Does the race condition exist when emulating x86 on Apple M1?

mmwelt4y ago

Apple added hardware support for x86 memory semantics.

https://news.ycombinator.com/item?id=28731534

https://mobile.twitter.com/ErrataRob/status/1331735383193903...

saagarjha4y ago

No. Rosetta emulates TSO correctly.

addaon4y ago

To draw together the two answers here to the original question.

1) Emulating an ISA includes emulating its memory model. As saagarjha says, this means that Rosetta 2 must (and does) correctly implement total store ordering.

1 more reply

silisili4y ago· 2 in thread

Relevant quote from Jim Keller: You run this program a hundred times, it never runs the same way twice. Ever.

krylon4y ago

Heraclitus, mumbling into his beard: "Told you so!"

SCNR

vlovich1234y ago

agalunar4y ago· 2 in thread

Great write-up!

There may be a typo in section 3:

> It will happily retire instruction 6 before instruction 5.

If memory serves, although instructions can execute out-of-order, they retire in-order (hence the "re-order buffer").

colejohnson664y ago

You are correct. The retire unit ensures that all micro ops are retired in order

stong1OP4y ago

Nice catch. I fixed it. I should have said "execute" rather than "retire".

half-kh-hacker4y ago· 1 in thread

this slaps. I always see perfect blue a few places above us!

nyanpasu644y ago

Context for downvoters: "perfect blue" is the CTF group writing this article, and "a few places" means CTF team rankings in competitions.

reitzensteinm4y ago

For those interested in memory ordering, I have a few posts on my blog where I build a simulator capable of understanding reorderings and analyze examples with it:

https://www.reitzen.com/post/temporal-fuzzing-01/ https://www.reitzen.com/post/temporal-fuzzing-02/

Next step are some lock free queues, although I haven't gotten around to publishing them!

0xfaded4y ago

gpderetta4y ago

tl;dr: just use std::atomic.

[1] it is of course possible they are actually present in the original code and just omitted from the explanation for brevity

cookiewill4y ago

Is it normal for the .got.plt section to be writable rather than read-only?

sydthrowaway4y ago

Any good references on low level details on ARMv8+?

drcongo4y ago

Nice try Intel.

j / k navigate · click thread line to collapse