From Rust to reality: The hidden journey of fetch_max (opens in new tab)

(questdb.com)

259 pointsbluestreak7mo ago59 comments

59 comments

Hi, author here. My superpower is spending unreasonable amounts of time researching things with no practical purpose. Occasionally I blog about it - as a warning to others.

trws7mo ago

I liked the article. I saw your PS that we added it to the working draft for c++26, we also made it part of OpenMP as of 5.0 I think. It’s sometimes a hardware atomic like on arm, but what made the case was that it’s common to implement it sub-optimally even on x86 or LL-SC architectures. Often the generic cas loop gets used, like in your lambda example, but it lacks an early cutout since you can ignore any input value that’s on the wrong side of the op by doing a cheap atomic read or just cutting out of the loop after the first failed CAS if the read back shows it can’t matter. Also can benefit from using slightly different memory orders than the default on architectures like ppc64. It’s a surprisingly useful op to support that way.

If this kind of thing floats your boat, you might be interested in the non-reading variants of these as well. Mostly for things like add, max, etc but some recent architectures actually offer alternate operations to skip the read-back. The paper calls them “atomic reduction operations” https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p31...

anematode7mo ago

Curious: even with hardware atomics, wouldn't it be a good idea to first perform a non-atomic load to check for whether the store might be necessary (which would require the cache line to be locked), then only run the atomic max if it might change the value?

2 more replies

SkiFire137mo ago

> but it lacks an early cutout since you can ignore any input value that’s on the wrong side of the op by doing a cheap atomic read or just cutting out of the loop after the first failed CAS if the read back shows it can’t matter.

I believe this is a bit trickier than that, you would also need at least some kind of atomic barrier to preserve the ordering semantics of the successful update case.

Ethee7mo ago

It's these kinds of posts that I appreciate reading the most, so thank you for sharing!

owls-on-wires7mo ago

“…no practical purpose” Nonsense, I learned something about compilation today. Thank you for sharing.

ajayka7mo ago

Great article! Did you end up hiring the candidate?!

xarope7mo ago

looks around room, heads nodding.

Ah, a magician. welcome.

michalsustr7mo ago

Thank you for sharing, loved the article!

tux37mo ago

This blog sent me into a memory models rabbit hole again. Each time I end up feeling like I'm finally starting to get it, only for a 6 line litmus test with 4 loads and 2 stores to send me crashing back down.

It makes me feel a little better reading about the history of memory models in CPUs. If this stuff wasn't intuitive to Intel either, I'm at least in good company in being confused (https://research.swtch.com/hwmm#path_to_x86-tso)

I actually knew about fetch_max from "implementing" the corresponding instruction (risc-v amomax), but I haven't done any of the fun parts yet since my soft-CPU still only has a single core.

jamesmunns7mo ago

If you haven't seen it, Mara Bos' "Rust Atomics and Locks"[0] is an excellent book on this topic, even if you aren't particularly interested in Rust.

[0]: https://marabos.nl/atomics/

tux37mo ago

Thank you, it looks lovely!

Arnavion7mo ago

>Hold on. This wasn't a wrapper around a loop pattern - this was a first-class atomic operation, sitting right there next to fetch_add and fetch_or. Java doesn't have this. C++ doesn't have this. How could Rust just... have this?

C++26 (work-in-progress) does have std::atomic<T>::fetch_max . Not implemented in any toolchains yet, though.

https://en.cppreference.com/w/cpp/atomic/atomic/fetch_max

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p04...

bilkow7mo ago

That info is included later in the article:

> PS: After conducting this journey I learned that C++26 adds fetch_max too!

orlp7mo ago

Aarch64 does indeed have a proper atomic max, but even on x86-64 you can get a wait-free atomic max as long as you only need to support integers up to 64. In that case you can simply do a `lock or` with 1 << i as your maximum. You can even support larger sizes by using multiple registers, e.g. four 64-bit registers for a u8 maximum.

In most cases it's even better to just store a maximum per thread separately and loop over all threads once to compute the current maximum if you really need it.

jerrinot7mo ago

That’s a neat trick, albeit with limited applicability given the very narrow range. Thanks for sharing!

markcjeffrey7mo ago

Related: fetch_max is an instance of what the following SPAA 2013 paper calls an atomic "priority update" or atomic "write-with-max". This type of atomic operation can have much lower contention than its counterparts like atomic increment.

https://doi.org/10.1145/2486159.2486189 https://jshun.csail.mit.edu/contention.pdf

Jweb_Guru7mo ago

One of the most practically important papers out there, I wish it were better known (but fortunately I think the "right" people know about it).

yshui7mo ago

That's a cool find. I wonder if LLVM also does the other way around operation, where it pattern matches handwritten CAS loops and transform them into native ARM64 instructions.

jerrinot7mo ago

That's a very good question. A proper compiler engineer would know, but I will do my best to find something and report back.

Edit: I could not find any pass with a pattern matching to replace CAS loops. The closest thing I could find is this pass: https://github.com/llvm/llvm-project/blob/06fb26c3a4ede66755... I reckon one could write a similar pass to recognize CAS idioms, but its usefulness would be probably rather limited and not worth the effort/risks.

tialaramex7mo ago

The term of art for this technique is "idiom recognition" and it's proper ancient, like, APL compilers did have some idiom recognition 50+ years ago.

An example you'll see in say a modern C compiler is that if you write the obvious loop to calculate how many bits are set in an int, the actual machine code on a brand new CPU should be a single population count instruction, C provides neither intrinsics (like Rust) not a dedicated "popcount" feature, so you can't write that but it's obviously what you want here and yup an optimising C compiler will do that.

However, LLVM is dealing with an IR generated by other compiler folk so I think it probably has less use for idiom recognition. Clang would do the recognition and lower to the same LLVM IR as Rust does for its intrinsic population count core::intrinsics::ctpop so the LLVM backend doesn't need to spot this. I might be wrong, but I think that's how it works.

toth7mo ago

> An example you'll see in say a modern C compiler is that if you write the obvious loop to calculate how many bits are set in an int, the actual machine code on a brand new CPU should be a single population count instruction, C provides neither intrinsics (like Rust) not a dedicated "popcount" feature, so you can't write that but it's obviously what you want here and yup an optimising C compiler will do that.

C compilers definitely have intrinsics for this, for GCC for instance it is `__builtin_popcount`.

And apparently it has even standard language support for it since C23, it's `stdc_count_ones` [1] and in C++ you have `std::popcount` [2]

[1] https://en.cppreference.com/w/c/numeric/bit_manip.html [2] https://en.cppreference.com/w/cpp/numeric/popcount.html

2 more replies

Arnavion7mo ago

I checked Godbolt, with RISC-V instead of ARM since I'm more familiar with that, and it doesn't look like it.

https://gcc.godbolt.org/z/b5s4WjnTG

(amomax is the atomic fetch-max instruction. lr and sc are load-reserved and store-conditional instructions; sc is like a regular store except it only succeeds if the address was not modified since the previous lr that accessed it. IOW the assembly is basically one-to-one with the C source.)

TuxSH7mo ago

Somewhat related: I find annoying that C++ doesn't have fetch_update and that Rust's fetch_update doesn't support LL/SC.

Rust fetch_update uses the lowest common denominator, CAS, regardless of platform: https://godbolt.org/z/ncssGnsfx (see the call __aarch64_cas8_acq_rel). In hot loops this can mean double-digit perf loss.

gpderetta7mo ago

It is very hard to support LL/SC in generalized user code as the specific rules of what cause an LL lease to fail are generally non-portable (possibly not even within an architecture).

It could be implemented with a CAS fallback of course, but it seems a performance trap.

You could add the logic to the compiler to detect which specific code sequences are LL/SC safe, but at that point just providing built-ins for the most common operations is simpler.

minedwiz7mo ago

Did he get the job?

brcmthrowaway7mo ago

Wasn't a culture fit

delifue7mo ago

When reading I expected it to mention that each thread maintain thread local max and periodically sync to a global atomic can improve performance

jerrinot7mo ago

I expect candidates to suggest similar optimisations, but I felt it was unnecessary for the article itself.

vips7L7mo ago

Fun read! Makes me realize I should probably go reread Java Concurrency in Practice.

ShroudedNight7mo ago

Was this compiled at O0? The generated code looks unnecessarily long-winded - at the very least I would expect the match jump table to get culled to only the Relaxed implementation.

ambicapter7mo ago

> Note we did not ask rustc to optimize the code. If we did, the compiler would generate more efficient assembly: No spills to the stack, fewer jumps, no dispatch on memory ordering, etc. But I wanted to keep the output as close to the original IR as possible to make it easier to follow.

RTFA

ShroudedNight7mo ago

I did, however that call out did, admittedly, slip past me

1 more reply

IshKebab7mo ago

Yeah this comes from ARM and AXI, which has atomic max (and min, add, set, clear and xor). I assume ARM has all the corresponding instructions. RISC-V also has all of these in Zaamo.

MountainTheme127mo ago

Only slightly related, but GPUs also have such instructions (exposed as InterlockedMax in HLSL and atomicMax in GLSL and CUDA).

anematode7mo ago

Great article :)

j / k navigate · click thread line to collapse

59 comments

jerrinot7mo ago

Hi, author here. My superpower is spending unreasonable amounts of time researching things with no practical purpose. Occasionally I blog about it - as a warning to others.

trws7mo ago

anematode7mo ago

2 more replies

SkiFire137mo ago

I believe this is a bit trickier than that, you would also need at least some kind of atomic barrier to preserve the ordering semantics of the successful update case.

Ethee7mo ago

It's these kinds of posts that I appreciate reading the most, so thank you for sharing!

owls-on-wires7mo ago

“…no practical purpose” Nonsense, I learned something about compilation today. Thank you for sharing.

ajayka7mo ago

Great article! Did you end up hiring the candidate?!

xarope7mo ago

looks around room, heads nodding.

Ah, a magician. welcome.

michalsustr7mo ago

Thank you for sharing, loved the article!

tux37mo ago

I actually knew about fetch_max from "implementing" the corresponding instruction (risc-v amomax), but I haven't done any of the fun parts yet since my soft-CPU still only has a single core.

jamesmunns7mo ago

If you haven't seen it, Mara Bos' "Rust Atomics and Locks"[0] is an excellent book on this topic, even if you aren't particularly interested in Rust.

[0]: https://marabos.nl/atomics/

tux37mo ago

Thank you, it looks lovely!

Arnavion7mo ago

C++26 (work-in-progress) does have std::atomic<T>::fetch_max . Not implemented in any toolchains yet, though.

https://en.cppreference.com/w/cpp/atomic/atomic/fetch_max

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p04...

bilkow7mo ago

That info is included later in the article:

> PS: After conducting this journey I learned that C++26 adds fetch_max too!

orlp7mo ago

In most cases it's even better to just store a maximum per thread separately and loop over all threads once to compute the current maximum if you really need it.

jerrinot7mo ago

That’s a neat trick, albeit with limited applicability given the very narrow range. Thanks for sharing!

markcjeffrey7mo ago

https://doi.org/10.1145/2486159.2486189 https://jshun.csail.mit.edu/contention.pdf

Jweb_Guru7mo ago

One of the most practically important papers out there, I wish it were better known (but fortunately I think the "right" people know about it).

yshui7mo ago

That's a cool find. I wonder if LLVM also does the other way around operation, where it pattern matches handwritten CAS loops and transform them into native ARM64 instructions.

jerrinot7mo ago

That's a very good question. A proper compiler engineer would know, but I will do my best to find something and report back.

tialaramex7mo ago

The term of art for this technique is "idiom recognition" and it's proper ancient, like, APL compilers did have some idiom recognition 50+ years ago.

toth7mo ago

C compilers definitely have intrinsics for this, for GCC for instance it is `__builtin_popcount`.

And apparently it has even standard language support for it since C23, it's `stdc_count_ones` [1] and in C++ you have `std::popcount` [2]

[1] https://en.cppreference.com/w/c/numeric/bit_manip.html [2] https://en.cppreference.com/w/cpp/numeric/popcount.html

2 more replies

Arnavion7mo ago

I checked Godbolt, with RISC-V instead of ARM since I'm more familiar with that, and it doesn't look like it.

https://gcc.godbolt.org/z/b5s4WjnTG

TuxSH7mo ago

Somewhat related: I find annoying that C++ doesn't have fetch_update and that Rust's fetch_update doesn't support LL/SC.

gpderetta7mo ago

It is very hard to support LL/SC in generalized user code as the specific rules of what cause an LL lease to fail are generally non-portable (possibly not even within an architecture).

It could be implemented with a CAS fallback of course, but it seems a performance trap.

You could add the logic to the compiler to detect which specific code sequences are LL/SC safe, but at that point just providing built-ins for the most common operations is simpler.

minedwiz7mo ago

Did he get the job?

brcmthrowaway7mo ago

Wasn't a culture fit

delifue7mo ago

When reading I expected it to mention that each thread maintain thread local max and periodically sync to a global atomic can improve performance

jerrinot7mo ago

I expect candidates to suggest similar optimisations, but I felt it was unnecessary for the article itself.

vips7L7mo ago

Fun read! Makes me realize I should probably go reread Java Concurrency in Practice.

ShroudedNight7mo ago

Was this compiled at O0? The generated code looks unnecessarily long-winded - at the very least I would expect the match jump table to get culled to only the Relaxed implementation.

ambicapter7mo ago

RTFA

ShroudedNight7mo ago

I did, however that call out did, admittedly, slip past me

1 more reply

IshKebab7mo ago

Yeah this comes from ARM and AXI, which has atomic max (and min, add, set, clear and xor). I assume ARM has all the corresponding instructions. RISC-V also has all of these in Zaamo.

MountainTheme127mo ago

Only slightly related, but GPUs also have such instructions (exposed as InterlockedMax in HLSL and atomicMax in GLSL and CUDA).

anematode7mo ago

Great article :)

j / k navigate · click thread line to collapse