Wild performance tricks (opens in new tab)

(davidlattimore.github.io)

138 pointstbillington7mo ago82 comments

82 comments

I’d strongly caution against many of those “performance tricks.” Spawning an asynchronous task on a separate thread, often with a heap-allocated handle, solely to deallocate a local object is a dubious pattern — especially given how typical allocators behave under the hood.

I frequently encounter use-cases akin to the “Sharded Vec Writer” idea, and I agree it can be valuable. But if performance is a genuine requirement, the implementation needs to be very different. I once attempted to build a general-purpose trait for performing parallel in-place updates of a Vec<T>, and found it extremely difficult to express cleanly in Rust without degenerating into unsafe or brittle abstractions.

sweetjuly7mo ago

> especially given how typical allocators behave under the hood.

To say more about it: nearly any modern high performance allocator will maintain a local (private) cache of freed chunks.

This is useful, for example, if you're allocating and deallocating about the same amount of memory/chunk size over and over again since it means you can avoid entering the global part of the allocator (which generally requires locking, etc.).

If you make an allocation while the cache is empty, you have to go to the global allocator to refill your cache (usually with several chunks). Similarly, if you free and find your local cache is full, you will need to return some memory to the global allocator (usually you drain several chunks from your cache at once so that you don't hit this condition constantly).

If you are almost always allocating on one thread and deallocating on another, you end up increasing contention in the allocator as you will (likely) end up filling/draining from the global allocator far more often than if you kept in on just one CPU. Depending on your specific application, maybe this performance loss is inconsequential compared to the value of not having to actually call free on some critical path, but it's a choice you should think carefully about and profile for.

pjmlp7mo ago

This is exactly how C++/WinRT works, because people praising reference counting as GC algorithm often forget about possible stack overflows (if the destructor/Drop trait is badly written), or stop-the-world pauses when there is a domino effect of a reference reaching zero in a graph or tree structure.

So in C++/WinRT, which is basically the current C++ projection for COM and WinRT components, the framework moves the objects into a background thread before deletion, as such that those issues don't affect the performance of the main execution thread.

And given it is done by the same team, I would bet Rust/Windows-rs has the same optimization in place for COM/WinRT components.

malkia7mo ago

Some allocators may even "hold" on to the freed (from another thread) memory, until the original thread deletes it (which is not the case here), or that thread dies, and then they go on "gc"-ing it.

quotemstr7mo ago

> Even if it were stable, it only works with slices of primitive types, so we’d have to lose our newtypes (SymbolId etc).

That's weird. I'd expect it to work with _any_ type, primitive or not, newtype or not, with a sufficiently simple memory layout, the rough equivalent of what C++ calls a "standard-layout type" or (formerly) a "POD".

I don't like magical stdlibs and I don't like user types being less powerful than built-in ones.

Clever workaround doing a no-op transformation of the whole vector though! Very nearly zero-cost.

> It would be possible to ensure that the proper Vec was restored for use-cases where that was important, however it would add extra complexity and might be enough to convince me that it’d be better to just use transmute.

Great example of Rust being built such that you have to deal with error returns and think about C++-style exception safety.

> The optimisation in the Rust standard library that allows reuse of the heap allocation will only actually work if the size and alignment of T and U are the same

Shouldn't it work when T and U are the same size and T has stricter alignment requirements than U but not exactly the same alignment? In this situation, any U would be properly aligned because T is even more aligned.

aw16211077mo ago

> I'd expect it to work with _any_ type, primitive or not, newtype or not, with a sufficiently simple memory layout, the rough equivalent of what C++ calls a "standard-layout type" or (formerly) a "POD".

This might be related in part to the fact that Rust chose to create specific AtomicU8/AtomicU16/etc. types instead of going for Atomic<T> like in C++. The reasoning for forgoing the latter is [0]:

> However the consensus was that having unsupported atomic types either fail at monomorphization time or fall back to lock-based implementations was undesirable.

That doesn't mean that one couldn't hypothetically try to write from_mut_slice<T> where T is a transparent newtype over one of the supported atomics, but I'm not sure whether that function signature is expressible at the moment. Maybe if/when safe transmutes land, since from_mut_slice is basically just doing a transmute?

> Shouldn't it work when T and U are the same size and T has stricter alignment requirements than U but not exactly the same alignment? In this situation, any U would be properly aligned because T is even more aligned.

I think this optimization does what you say? A quick skim of the source code [1] seems to show that the alignments don't have to exactly match:

    //! # Layout constraints
    //! <snip>
    //! Alignments of `T` must be the same or larger than `U`. Since alignments are always a power
    //! of two _larger_ implies _is a multiple of_.

And later:

    const fn in_place_collectible<DEST, SRC>(
        step_merge: Option<NonZeroUsize>,
        step_expand: Option<NonZeroUsize>,
    ) -> bool {
        if const { SRC::IS_ZST || DEST::IS_ZST || mem::align_of::<SRC>() < mem::align_of::<DEST>() } {
            return false;
        }
        // Other code that deals with non-alignment conditions
    }

[0]: https://github.com/Amanieu/rfcs/blob/more_atomic_types/text/...

[1]: https://github.com/rust-lang/rust/blob/c58a5da7d48ff3887afe4...

quotemstr7mo ago

> I think this optimization does what you say?

Cool. Thanks for checking! I guess the article should be tweaked a bit --- it states that the alignment has to match exactly.

0x1ceb00da7mo ago

> Great example of Rust being built such that you have to deal with error returns and think about C++-style exception safety.

Not really. Panics are supposed to be used in super exceptional situations, where the only course of action is to abort the whole unit of work you're doing and throw away all the resources. However you do have to be careful in critical code because things like integer overflow can also raise a panic.

quotemstr7mo ago

> be careful in critical code because things like integer overflow can also raise a panic

So you can basically panic anywhere. I understand people have looked at no-panic markers (like C++ noexcept) but the proposals haven't gone anywhere. Consequently, you need to maintain the basic exception safety guarantee [1] at all times. In safe Rust, the compiler enforces this level of safety in most cases on its own, but there are situations in which you can temporarily violate program invariants and panic before being able to restore them. (A classic example is debiting from one bank account before crediting to another. If you panic in the middle, the money is lost.)

If you want that bank code to be robust against panics, you need to use something like https://docs.rs/scopeguard/latest/scopeguard/

In unsafe Rust, you basically have the same burden of exception safety that C++ creates, except your job as an unsafe Rust programmer is harder than a C++ programmer's because Rust doesn't have a noexcept. Without noexcept, it's hard to reason about which calls can panic and which can't, so it's hard to make bulletproof cleanup paths.

Most Rust programmers don't think much about panics, so I assume most Rust programs are full of latent bugs of this sort. That's why I usually recommend panic=abort.

[1] https://en.wikipedia.org/wiki/Exception_safety#Classificatio...

AnyTimeTraveler7mo ago

Rust's number types have functions like "wrapping_add" or "overflowing_add", which do not panic when overflowing and instead explicitly wrap around or return a result that must be checked.

You can easily write code that does not contain any possible panic points, if you want.

1 more reply

0x1ceb00da7mo ago

> If you panic in the middle, the money is lost.

Wouldn't a relational database help deal with this?

> Without noexcept, it's hard to reason about which calls can panic and which can't, so it's hard to make bulletproof cleanup paths.

Unsafe blocks usually contain very low level code, so you can understand what your code does very accurately. If the unsafe code calls a dependency which calls 150 other dependencies transitively, yeah, that's going to be pretty bad.

1 more reply

vlovich1237mo ago

> However you do have to be careful in critical code because things like integer overflow can also raise a panic.

This is incorrect. Only in debug builds does it raise a panic. In release Rust has to make the performance tradeoff that C++ does and defines signed integer math to wrap 2’s complement. Only in debug will signed overflow panic. Unsigned math never panics - it’s always going to overflow 2’s complement.

quotemstr7mo ago

> Only in debug builds does it raise a panic.

Correctness in debug builds is important, isn't it?

That said, panic on integer overflow in debug builds is unfortunate behavior. Overflow should cause an abort, not a panic.

> make the performance tradeoff that C++ does and defines signed integer math to wrap 2’s complement

In C++, signed overflow is undefined behavior, not wraparound. This property is useful to the optimizer for things like inferring loop bounds. The optimizer has less flexibility in equivalent Rust code.

2 more replies

0x1ceb00da7mo ago

You can enable overflow panics in release build, so if you're a library, you have to play it safe because you don't know how people will build your library.

emtel7mo ago

I read this a few weeks ago, and was inspired to do some experiments of my own on compiler explorer, and found the following interesting things:

  // This compiles down to a memmove() call
  let my_vec: Vec<_> = my_vec.into_iter().skip(n).collect();

  // this results in significantly smaller machine code than `v.retain(f)`
  v.into_iter().filter(f).collect();

This was all with -C opt-level=2. I only looked at generated code size, didn't have time to benchmark any of these.

ComputerGuru7mo ago

I don't like relying on (release-only) llvm optimizations for a number of reasons, but primarily a) they break between releases, more often than you'd think, b) they're part of the reason why debug builds of rust software are so much slower (at runtime) than release builds, c) they're much harder to verify (and very opaque).

For non-performance-sensitive code, sure, go ahead and rely on the rust compiler to compile away the allocation of a whole new vector of a different type to convert from T to AtomicT, but where the performance matters, for my money I would go with the transmute 100% of the time (assuming the underlying type was decorated with #[transparent], though it would be nice if we could statically assert that). It'll perform better in debug mode, it's obvious what you are doing, it's guaranteed not to break in a minor rustc update, and it'll work with &mut [T] instead of an owned Vec<T> (which is a big one).

mjmas7mo ago

The particular optimisation for non-copying Vec -> IntoIter -> Vec transform is actually hard coded in the standard library as a special case of collecting an Iterator into a Vec. It doesn't rely on the backend for this.

Though this optimisation is treated as an implementation detail [1].

[1]: https://doc.rust-lang.org/stable/std/vec/struct.Vec.html#imp...

ComputerGuru7mo ago

Thanks for pointing that out, TIL!

vlovich1237mo ago

> Now that we have a Vec with no non-static lifetimes, we can safely move it to another thread.

I liked most of the tricks but this one seems pointless. This is no different than transmute as accessing the borrower requires an assume_init which I believe is technically UB when called on an uninit. Unless the point is that you’re going to be working with Owned but want to just transmute the Vec safely.

Overall I like the into_iter/collect trick to avoid unsafe. It was also most of the article, just various ways to apply this trick in different scenarios. Very neat!

ComputerGuru7mo ago

You misunderstood the purpose of that trick. The vector is not going to be accessed again, the idea is to move it to another thread so it can be dropped in parallel (never accessed).

Strilanc7mo ago

Every one of these "performance tricks" is describing how to convince rust's borrow checker that you're allowed to do a thing. It's more like "performance permission slips".

oleganza7mo ago

You don't have to play this game - you can always write within unsafe { ... } like in plain old C or C++. But people do choose to play this game because it helps them to write code that is also correct, where "correct" has an old-school meaning of "actually doing what it is supposed to do and not doing what it's not supposed to".

ManlyBread7mo ago

That just makes it seem like there's no point in using this language in the first place.

3 more replies

Ar-Curunir7mo ago

This is an issue that you would face in any language with strong typing. It only rears its head in Rust because Rust tries to give you both low-level control and strong types.

For example, in something like Go (which has a weaker type system than Rust), you wouldn't think twice about, paying for the re-allocation in buffer-reuse example.

Of course, in something like C or C++ you could do these things via simple pointer casts, but then you run the risk of violating some undefined behaviour.

jstimpfle7mo ago

In C I wouldn't use such a fluffy high-level approach in the first place. I wouldn't use contiguous unbounded vec-slices. And no, I wouldn't attempt trickery with overwriting input buffers. That's a bad inflexible approach that will bite at the next refactor. Instead, I would first make sure there's a way to cheaply allocate fixed size buffers (like 4 K buffers or whatever) and stream into those. Memory should be used in a allocate/write-once/release fashion whenever possible. This approach leads to straightforward, efficient architecture and bug-free code. It's also much better for concurrency/parallelism.

4 more replies

jandrewrogers7mo ago

> in something like C or C++ you could do these things via simple pointer casts

No you don't. You explicitly start a new object lifetime at the address, either of the same type or a different type. There are standard mechanisms for this.

Developers that can't be bothered to do things correctly is why languages like Rust exist.

Ar-Curunir7mo ago

And that is safer... how?

1 more reply

jadbox7mo ago

When I read articles like this, I just relish how much Go, Zig, and Bun make my life to much easier in terms of solving performance issues with reasonable trade-offs.

pjmlp7mo ago

It is more of a culture thing, most compiled languages have been fast enough for quite some time.

People using systems languages more often than not go down the rabbit hole of performance tuning, many times without a profiler, because still isn't the amount of ms that is supposed to be.

In reality unless one is writing an OS component, rendering engine, some kind of real time constrained code, or server code for "Webscale", the performance is more than enough for 99% of the use cases, in any modern compiler.

kibwen7mo ago

...Except that Rust is thread-safe, so expressing your algorithm in terms that the borrow checker accepts makes safe parallelism possible, as shown in the example using Rayon to trivially parallelize an operation. This is the whole point of Rust, and to say that C and C++ fail at thread-safety would be the understatement of the century.

jstimpfle7mo ago

Memory safe doesn't imply memory efficient and reasonable. Wrapping everything in Arc is the opposite of simple & easy.

kibwen7mo ago

If you're writing C code that shares memory between threads without some sort of synchronization primitive, then as your doctor I'm afraid I'm going to have to ask you to sit down, because I have some very bad news for you.

1 more reply

jstimpfle7mo ago

Yup -- yet another article only solving language level problems instead of teaching something about real constraints (i.e. hardware performance characteristics). Booooring. This kind of article is why I still haven't mustered the energy to get up to date with Rust. I'm still writing C (or C-in-C++) and having fun, most of the time feeling like I'm solving actual technical problems.

Ar-Curunir7mo ago

This was an article distilled from a talk at a Rust developers conference. Onbviously it’s going to make most sense to rust devs, and will seem unnecessary to non-Rust devs.

the-smug-one7mo ago

The rayon thing is neat.

Cheetahlee017mo ago

Just want to know some hacking tricks

0x1ceb00da7mo ago

> It’d be reasonable to think that this will have a runtime cost, however it doesn’t. The reason is that the Rust standard library has a nice optimisation in it that when we consume a Vec and collect the result into a new Vec, in many circumstances, the heap allocation of the original Vec can be reused. This applies in this case. But what even with the heap allocation being reused, we’re still looping over all the elements to transform them right? Because the in-memory representation of an AtomicSymbolId is identical to that of a SymbolId, our loop becomes a no-op and is optimised away.

Those optimisations that this code relies on are literally undefined behaviour. The compiler doesn't guarantee it's gonna apply those optimisations. So your code might suddenly become super slow and you'll have to go digging in to see why. Is this undefined behaviour better than just having an unsafe block? I'm not so sure. The unsafe code will be easier to read and you won't need any comments or a blog to explain why we're doing voodoo stuff because the logic of the code will explain its intentions.

steveklabnik7mo ago

> Those optimisations that this code relies on are literally undefined behaviour.

You cannot get undefined behavior in Rust without an unsafe block.

> The compiler doesn't guarantee it's gonna apply those optimisations.

This is a different concept than UB.

However, for the "heap allocation can be re-used", Rust does talk about this: https://doc.rust-lang.org/stable/std/vec/struct.Vec.html#imp...

It cannot guarantee it for arbitrary iterators, but the map().collect() re-use is well known, and the machinery is there to do this, so while other implementations may not, rustc always will.

Basically, it is implementation-defined behavior. (If it were C/C++ it would be 'unspecified behavior' because rustc does not document exactly when it does this, but this is a very fine nitpick and not language Rust currently uses, though I'd argue it should.)

> So your code might suddenly become super slow and you'll have to go digging in to see why.

That's why wild has performance tests, to ensure that if a change breaks rustc's ability to optimize, it'll be noticed, and therefore fixed.

0x1ceb00da7mo ago

> That's why wild has performance tests, to ensure that if a change breaks rustc's ability to optimize, it'll be noticed, and therefore fixed.

But benchmarks won't tell us which optimisation suddenly stopped working. This looks so similar to the argument against UB to me. Something breaks, but you don't know what, where, and why.

1 more reply

stouset7mo ago

You're misusing the term "undefined behavior". You can certainly say that these kinds of performance optimizations aren't guaranteed.

1 more reply

j / k navigate · click thread line to collapse

82 comments

ashvardanian7mo ago

sweetjuly7mo ago

> especially given how typical allocators behave under the hood.

To say more about it: nearly any modern high performance allocator will maintain a local (private) cache of freed chunks.

pjmlp7mo ago

And given it is done by the same team, I would bet Rust/Windows-rs has the same optimization in place for COM/WinRT components.

malkia7mo ago

Some allocators may even "hold" on to the freed (from another thread) memory, until the original thread deletes it (which is not the case here), or that thread dies, and then they go on "gc"-ing it.

quotemstr7mo ago

> Even if it were stable, it only works with slices of primitive types, so we’d have to lose our newtypes (SymbolId etc).

I don't like magical stdlibs and I don't like user types being less powerful than built-in ones.

Clever workaround doing a no-op transformation of the whole vector though! Very nearly zero-cost.

Great example of Rust being built such that you have to deal with error returns and think about C++-style exception safety.

> The optimisation in the Rust standard library that allows reuse of the heap allocation will only actually work if the size and alignment of T and U are the same

aw16211077mo ago

This might be related in part to the fact that Rust chose to create specific AtomicU8/AtomicU16/etc. types instead of going for Atomic<T> like in C++. The reasoning for forgoing the latter is [0]:

> However the consensus was that having unsupported atomic types either fail at monomorphization time or fall back to lock-based implementations was undesirable.

I think this optimization does what you say? A quick skim of the source code [1] seems to show that the alignments don't have to exactly match:

    //! # Layout constraints
    //! <snip>
    //! Alignments of `T` must be the same or larger than `U`. Since alignments are always a power
    //! of two _larger_ implies _is a multiple of_.

And later:

    const fn in_place_collectible<DEST, SRC>(
        step_merge: Option<NonZeroUsize>,
        step_expand: Option<NonZeroUsize>,
    ) -> bool {
        if const { SRC::IS_ZST || DEST::IS_ZST || mem::align_of::<SRC>() < mem::align_of::<DEST>() } {
            return false;
        }
        // Other code that deals with non-alignment conditions
    }

[0]: https://github.com/Amanieu/rfcs/blob/more_atomic_types/text/...

[1]: https://github.com/rust-lang/rust/blob/c58a5da7d48ff3887afe4...

quotemstr7mo ago

> I think this optimization does what you say?

Cool. Thanks for checking! I guess the article should be tweaked a bit --- it states that the alignment has to match exactly.

0x1ceb00da7mo ago

> Great example of Rust being built such that you have to deal with error returns and think about C++-style exception safety.

quotemstr7mo ago

> be careful in critical code because things like integer overflow can also raise a panic

If you want that bank code to be robust against panics, you need to use something like https://docs.rs/scopeguard/latest/scopeguard/

Most Rust programmers don't think much about panics, so I assume most Rust programs are full of latent bugs of this sort. That's why I usually recommend panic=abort.

[1] https://en.wikipedia.org/wiki/Exception_safety#Classificatio...

AnyTimeTraveler7mo ago

Rust's number types have functions like "wrapping_add" or "overflowing_add", which do not panic when overflowing and instead explicitly wrap around or return a result that must be checked.

You can easily write code that does not contain any possible panic points, if you want.

1 more reply

0x1ceb00da7mo ago

> If you panic in the middle, the money is lost.

Wouldn't a relational database help deal with this?

> Without noexcept, it's hard to reason about which calls can panic and which can't, so it's hard to make bulletproof cleanup paths.

1 more reply

vlovich1237mo ago

> However you do have to be careful in critical code because things like integer overflow can also raise a panic.

quotemstr7mo ago

> Only in debug builds does it raise a panic.

Correctness in debug builds is important, isn't it?

That said, panic on integer overflow in debug builds is unfortunate behavior. Overflow should cause an abort, not a panic.

> make the performance tradeoff that C++ does and defines signed integer math to wrap 2’s complement

2 more replies

0x1ceb00da7mo ago

You can enable overflow panics in release build, so if you're a library, you have to play it safe because you don't know how people will build your library.

emtel7mo ago

I read this a few weeks ago, and was inspired to do some experiments of my own on compiler explorer, and found the following interesting things:

  // This compiles down to a memmove() call
  let my_vec: Vec<_> = my_vec.into_iter().skip(n).collect();

  // this results in significantly smaller machine code than `v.retain(f)`
  v.into_iter().filter(f).collect();

This was all with -C opt-level=2. I only looked at generated code size, didn't have time to benchmark any of these.

ComputerGuru7mo ago

mjmas7mo ago

Though this optimisation is treated as an implementation detail [1].

[1]: https://doc.rust-lang.org/stable/std/vec/struct.Vec.html#imp...

ComputerGuru7mo ago

Thanks for pointing that out, TIL!

vlovich1237mo ago

> Now that we have a Vec with no non-static lifetimes, we can safely move it to another thread.

Overall I like the into_iter/collect trick to avoid unsafe. It was also most of the article, just various ways to apply this trick in different scenarios. Very neat!

ComputerGuru7mo ago

You misunderstood the purpose of that trick. The vector is not going to be accessed again, the idea is to move it to another thread so it can be dropped in parallel (never accessed).

Strilanc7mo ago

Every one of these "performance tricks" is describing how to convince rust's borrow checker that you're allowed to do a thing. It's more like "performance permission slips".

oleganza7mo ago

ManlyBread7mo ago

That just makes it seem like there's no point in using this language in the first place.

3 more replies

Ar-Curunir7mo ago

This is an issue that you would face in any language with strong typing. It only rears its head in Rust because Rust tries to give you both low-level control and strong types.

For example, in something like Go (which has a weaker type system than Rust), you wouldn't think twice about, paying for the re-allocation in buffer-reuse example.

Of course, in something like C or C++ you could do these things via simple pointer casts, but then you run the risk of violating some undefined behaviour.

jstimpfle7mo ago

4 more replies

jandrewrogers7mo ago

> in something like C or C++ you could do these things via simple pointer casts

No you don't. You explicitly start a new object lifetime at the address, either of the same type or a different type. There are standard mechanisms for this.

Developers that can't be bothered to do things correctly is why languages like Rust exist.

Ar-Curunir7mo ago

And that is safer... how?

1 more reply

jadbox7mo ago

When I read articles like this, I just relish how much Go, Zig, and Bun make my life to much easier in terms of solving performance issues with reasonable trade-offs.

pjmlp7mo ago

It is more of a culture thing, most compiled languages have been fast enough for quite some time.

People using systems languages more often than not go down the rabbit hole of performance tuning, many times without a profiler, because still isn't the amount of ms that is supposed to be.

kibwen7mo ago

jstimpfle7mo ago

Memory safe doesn't imply memory efficient and reasonable. Wrapping everything in Arc is the opposite of simple & easy.

kibwen7mo ago

1 more reply

jstimpfle7mo ago

Ar-Curunir7mo ago

This was an article distilled from a talk at a Rust developers conference. Onbviously it’s going to make most sense to rust devs, and will seem unnecessary to non-Rust devs.

the-smug-one7mo ago

The rayon thing is neat.

Cheetahlee017mo ago

Just want to know some hacking tricks

0x1ceb00da7mo ago

steveklabnik7mo ago

> Those optimisations that this code relies on are literally undefined behaviour.

You cannot get undefined behavior in Rust without an unsafe block.

> The compiler doesn't guarantee it's gonna apply those optimisations.

This is a different concept than UB.

However, for the "heap allocation can be re-used", Rust does talk about this: https://doc.rust-lang.org/stable/std/vec/struct.Vec.html#imp...

It cannot guarantee it for arbitrary iterators, but the map().collect() re-use is well known, and the machinery is there to do this, so while other implementations may not, rustc always will.

> So your code might suddenly become super slow and you'll have to go digging in to see why.

That's why wild has performance tests, to ensure that if a change breaks rustc's ability to optimize, it'll be noticed, and therefore fixed.

0x1ceb00da7mo ago

> That's why wild has performance tests, to ensure that if a change breaks rustc's ability to optimize, it'll be noticed, and therefore fixed.

But benchmarks won't tell us which optimisation suddenly stopped working. This looks so similar to the argument against UB to me. Something breaks, but you don't know what, where, and why.

1 more reply

stouset7mo ago

You're misusing the term "undefined behavior". You can certainly say that these kinds of performance optimizations aren't guaranteed.

1 more reply

j / k navigate · click thread line to collapse