* The example code is obviously contrived. The real gist is that massive deallocations in the UI thread cause lag, which the example code proves. That very thing can easily happen in the real world.
* I didn't see any difference on my machine between a debug build and a release build.
* The example is preforming 1 _million_ deallocations. That's why it's so pathological. It's not just a "large" vector. It's a vector of 1 million vectors. While that may seem contrived, consider a vector of 1 million strings, something that's not too uncommon, and which would likely suffer the same performance penalty.
* Rust is not copying anything, nor duplicating the structures here. In the example code the structures would be moved, not copied, which costs nothing. The deallocation is taking up 99% of the time.
* As an aside, compilers have used the trick of not free-ing data structures before, because it provides a significant performance boost. Instead of calling free on all those billions of tiny data structures a compiler would generate during its lifetime, they just let them leak. Since a compiler is short lived its not a problem, they get a free lunch (pun unintended), and the OS takes care of cleaning up after all is said and done. My point is that this post isn't theoretical, we do deallocation trickery in the real world.
This reminds me of the exploding ultimate GC technique [1]:
> on-board software for a missile...chief software engineer said "Of course it leaks". ... They added this much additional memory to the hardware to "support" the leaks. Since the missile will explode when it hits its target or at the end of its flight, the ultimate in garbage collection is performed without programmer intervention.
[1]: https://devblogs.microsoft.com/oldnewthing/20180228-00/?p=98...
[0]: https://blogs.oracle.com/javamagazine/epsilon-the-jdks-do-no...
And then someone tries to use your compiler as a service (code analysis, change triggered compiler) and it's a dead end
If I allocate just enough memory, but not too much, then pauses for defragmentation of free space may be costed to the code that calls me.
A solution to this that I’ve seen in soft real time systems is to amortize cleanups across all allocations. Every allocation performs n steps of a cleanup process prior to receiving a block of memory. In which case most of the bad actors have to pay part of the cost of memory overhead.
Might be good for Rust to try something in that general realm, or in the cleanup side may be easier to tack on. On free, set a ceiling for operations and queue what is left. That would at least peak shave.
In theory, you could also use a memory pool in Rust but I think the standard library uses malloc without some way of overriding this behaviour.
Edit: this will also break any code that relies on Drop being called for clean up, but that is already a "suspect"/incorrect pattern because there are no assurances that it will ever run.
#[global_allocator]
static GLOBAL: MyAllocator = MyAllocator;I was thinking of the following code, where I believe the assignment to y is actually free. Though apparently this isn't called a "move".
let x = <<large owned type like [char; 1000]>>;
let y = x;
More info: https://doc.rust-lang.org/rust-by-example/scope/move.htmlI.e. a `HashMap` struct, or `Vec` struct don't directly contain the data.
For example the `Vec` is defined internally as something similar to:
`struct Vec<T> { data: *mut [T], capacity: usize, len: usize, marker: PhantomData<T> }`
(Slightly simplified, not actual Vec type).
So a move of a Vec copies at most 3 usize (24 bytes on 64bit systems), similar thinks apply for a HashMap.
Additionally the copy can often be elided through compiler optimizations.
As a interesting side note a new empty Vec/HashMap will not actually allocate any memory, only once elements get added it will start doing so. This is why it crates vec's of vecs of length 1. Or else it wouldn't need to do "number of element" free calls.
However, these copies can often be elided by optimizations.
In Clang, this flag is `-Xclang -disable-free`. Not from a Jedi...
For example, I had the "million strings" problem once, literally millions. The solution was to put every string into a single large buffer and the pointers in another buffer. Not only I could deallocate everything at once but I also saved a bit of RAM by not aligning (not needed for strings).
Only if badly designed. That is why it is contrived!
> While that may seem contrived, consider a vector of 1 million strings, something that's not too uncommon
A program dealing with a million elements of any kind should not be performing naive allocations to begin with.
> we do deallocation trickery in the real world
Skipping deallocations is an optimization, not a design pattern.
In other words, the code needs to keep the ability to perform the deallocation for debugging, testing, usage as a library, etc.
Generational/compacting GC has the opposite problem. Garbage collection takes time proportional to the live set, and the amount of memory collected is unimportant.
It's actually a lot to be said for rust that the ownership system lets you transfer freeing responsibility off-thread safely and cheaply in order to not have it block the critical path.
But overall, there's nothing really unexpected here, if you're familiar with memory management.
http://researcher.watson.ibm.com/researcher/files/us-bacon/B...
Takes time proportional the live set times the number of GC runs that happen while the objects are alive. In other words, the longer the objects live, the more GC runs have to scan that object (assuming there is enough activity to trigger the GC), and the worse GC looks.
This can also trivially be done in other languages. Atomically append your pointer to a queue of "large things that need to be freed" and move on as though you had actually called free.
Within a particularly time sensitive loop you can even opt to place pointers into a preallocated array locally. Then once per loop iteration swap that array with the thread handling the deallocations for you. It eats up a bit of CPU time but can significantly reduce latency.
Consider this code
{
Window a;
ClickHandler* b = new ClickHandler(&a);
delete b;
}
Let's say b tries to deregister itself when it's deleted. This code will work as written. But if you defer the deletion of b, then stack allocated Window a may already be gone.That doesn't seem to make intuitive sense. A GC has the same problem.
A garbage collector has to traverse the data structure in a similar way to determine whether it (and it's embedded keys and values) are part of the live set or not, and to invoke finalizers. You're beginning your comparison after the mark step, which isn't a fair assessment since what Rust is doing is akin both both the mark and sweep phases.
The only way to drop an extensively nested structure like this any faster than traversing it would be an arena allocator, and forgetting about the entire arena.
The difference between a GC and this kind of memory management is that the GC does the traversal later, at some point, non-deterministically. Rust allows you to decide between deallocating it in place, immediately, or deferring it to a different thread.
A generational/compacting collector traverses pointers from the live roots, and copies everything it finds to the start of its memory space, and then declares the rest unused. If there is 1GB of unused memory, it's irrelevant. Only the things that can be reached are even examined.
As I said, this has the opposite problem. When the live set becomes huge, this can drag performance. When the live set is small, it doesn't matter how much garbage it produces, performance is fast.
Yes, but in practice tracing in a tracing GC is done concurrently and with the help of GC barriers that don't require synchronization and so are generally cheaper than the common mechanisms for reference-counting GC.
> and to invoke finalizers
As others have said, finalizers are very uncommon and, in fact, have been deprecated in Java.
Isn't that incompatible with RAII though?
> A garbage collector has to traverse the data structure in a similar way to determine whether it (and it's embedded keys and values) are part of the live set or not, and to invoke finalizers.
All garbage collectors start from live objects and only scan those. Then, whatever objects they have not scanned get collected. In copying collectors (like most generational ones), this means that garbage is never touched.
In the mark-and-sweep algorithms, the mark phase still never touches the unreachable objects. However, the sweep phase does need to return those objects to the free list, so it will have to walk them. It will still not do it the same way as malloc/free, as it can walk the heap in order and free unmarked objects as it encounters them, no need to follow pointers, so it may still have better cache performance.
Finalizers introduce extra difficulty, but still the behavior is fundamentally different. What usually happens is that objects which have finalizers are usually remembered in a special list which acts as a GC root itself. When they are only reachable from that list, they get marked so that the finalize will run (usually on a special Finalizer thread). When the Finalizer is finished, and assuming the object was not resurrected, they get removed from the Finalizer list, and now they are not reachable from anywhere at all, so the next GC will finally clean them up. Usually, there is also some API for user code to mark a Finalizable object as 'finalized', which essentially removed it from the Finalizer list early and allows it to be collected as normal, without going through the above process.
And yes, having a large number of finalizable objects in your memory is usually considered a very bad idea. Generally, they are only recommended as a fail-safe measure: you are supposed to do explicit cleaning, but as a fail-safe, to avoid your program crashing in production if a connection or file leak was missed, you also have the Finalizer to throw buckets of water out of your boat (but you should really notice that it is happening and plug that leak, rather then relying on the bucketeer).
As you note that will certainly add some overhead, although that could be minimized by not spawning a fresh thread each time. It could easily reduce latency for a thread the UI is waiting on in many cases.
And this isn't just a help in these contrived examples. I believe process cleanup (an extreme case of cleaning up objects) is one of cases where garbage collection performs better because it doesn't have to unwind the stack, call cleanup functions that are not in the cache, and make a lot of `free` calls to the allocator.
I vaguely remember reading about Google killing processes rather than having them clean up correctly, relying on the OS to properly clean up any resources of significance.
Now this doesn't mean you should do this in all cases. Profile first, see if you can avoid the large objects, and then look into deferred de-allocations ... if the timing of resource cleanup meets your application's guarantees.
I recall Firefox preventing cleanup code from running when you quit a few years ago. Prior to that, quitting with a lot of pages open (ie hundreds) could cause it to lock up for quite some time.
It is also the approach taken by C++/WinRT, COM and UWP components get moved into a background cleaning thread, to avoid application pauses on complex data structures reaching zero count.
Rust is based on ownership and statically checked move semantics (by default though can be opted out). So each item has a single owner (which is why Rust deals very badly with graphs, and more generally any situation where ownership is unclear) and the compiler will prevent you from using a moved object (unlike C++).
Separately it has a smart pointer which is the dual of unique_ptr (Box), with the guarantee noted above:
let b = Box::new(1);
drop(b);
println!("{}", b);
will not compile because the second line moves the box, after which it can't be used because it's been removed entirely from this scope.To be fair, so do 90+% of programmers. Much of rust's benefit in safe code is training programmers to avoid code like that where possible, and spreading design patterns that avoid it.
If you know that an object will live for the rest of the program and not need any finalization logic, Rust allows you to "leak" it and save that overhead on shutdown.
And it can even be worse if it's holding a limited resource, like a file descriptor or a database connection. That is, I wouldn't recommend using this trick unless you're sure that the only thing the "heavy thing" is holding is memory (and even then, keep in mind that memory can also be a limited resource).
Granted the way the type system work you usually know the type of a variable quite well, but could this happen with opaque types?
I'm very much out of my depth, but it felt like one of those things that could really bite you if you are unaware, as happened with finalizers in Java decades ago.
If you're the one creating the structure, you could opt it out of Send, that'd make it… not sendable. So it wouldn't be able to cross thread-boundaries. For instance Rc is !Send, you simply can not send it across a thread-boundary (because it's a non-threadsafe reference-counting handle).
If you don't control the type, then you'd have to wrap it (newtype pattern) or remember to manually mem::drop it. The latter would obviously have no safety whatsoever, the former you might be able to lint for I guess, though even that is limited or complicated (because of type inference the problematic type might never get explicitly mentioned).
For the more general problem you have can also dedicate more threads to the task or apply backpressure.
https://doc.rust-lang.org/book/ch04-01-what-is-ownership.htm...
Edit: it seems like turning on optimizations seems to improve the situation quite a bit. Not sure why they were profiling the debug build.
Regardless of memset and optimizations, consider a particularly complicated object which lives on the heap and contains hundreds of other nested objects (which themselves contain nested objects, etc). Now imagine that a significant fraction of them make use of RAII. That cleanup code can't be elided.
That being said, it's a pretty bad example if they were actually profiling the debug build ...
I'm not seeing that on my local machine? Were you comparing on the Playground which would be quite variable in its results?
> cargo build
Compiling foo v0.1.0 (/private/tmp/foo)
Finished dev [unoptimized + debuginfo] target(s) in 0.42s
> ./target/debug/foo
drop in another thread 52.121µs
drop in this thread 514.687233ms
>
>
> cargo build --release
Compiling foo v0.1.0 (/private/tmp/foo)
Finished release [optimized] target(s) in 0.47s
> ./target/release/foo
drop in another thread 48.418µs
drop in this thread 548.005373msThis is the most important point in the thread, since it invalidates the results for most purposes.
Interestingly, the drop function is actually user-creatable. It’s actually an empty function with a very permissive Non-reference argument. The semantics of ownership in Rust makes that sufficient to trigger memory cleanup.
Is there any profiler that does this today?
What are the drawbacks with asynchronous drops?
In particular, types may not be sendable to other threads, or may have side effects on dropping, and in those cases you would need to rearchitect the code before you can apply this technique.
Also this technique adds overhead, so it should never be used (including not doing it conditionally) if you don't care about latency or if the objects are always small, and the compiler cannot know whether that is the case.
Generally you'd only pass ownership when that's needed for some reason. So this toy example might not be realistic but it does demonstrate the performance impact.
I'm hoping this is down to developer naivety rather than being a feature of rust.
2) but somewhere somehow this object will deallocate, so his trick of putting it to another thread would work if the deal location takes awhile. Same for cpp if you have a massive object in a unique ptr. So it’s not a rust issue
there is no copy happening here
I am writing a static site generator. When run in "watch" mode, it deletes everything and starts over (I'd like to reduce these with partial updates but can't always do it). Moving that cleanup to a thread would make "watch" more responsive.
The issue from the article would be solved by just passing a reference to the variable.
In your case, cleanup is an action that needs to be done before writing new files. So you have to wait for cleanup anyway, don't you ?
Typically any server with a watch functionality will have a mutable reference to the data that's being watched. When you change that data out you're both changing the mutable reference, and also deallocating any memory that was previously used. One could separate these two steps, moving the watched data to another variable that's dropped in another thread, if you wanted.
fn get_len1(things: HeavyThings) -> usize {
things.len()
}
fn get_len2(things: HeavyThings) -> usize {
let len = things.len();
thread::spawn(move || drop(things));
len
}
The OP shows an example in which a function like get_len2 is 10000x faster than a function like get_len1 for a hashmap with 1M keys.See also this comment by chowells: https://news.ycombinator.com/item?id=23362925
Also this isn't rust specific. Most (all?) RAII languages are affected and many GC approaches have this effect, too. Some do add additional abstraction to magically always or sometimes put the de-allocation into another thread.
But de-allocating in another thread is not generally good or bad. There are a lot of use-cases where doing so is rather bad or can't be done (in case TLS is involved). Rust and other similar RAII languages at least let you decide what you want to do.
Now it's (I think) generally known that certain kinds (not all) of GC do make some thinks simpler for GUI-like usage. Through they also tend to have less control.
Note that it's a common pattern for small user CLI facing tools (which are not GC'ed) to leak resources instead of cleaning them up properly. You can do so in rust too if you want but it's a potential problem for longer running applications.
Also here is a faster get `get_len` then both which is also more idiomatic rust then both:
``` fn get_len1(things: &HeavyThings) -> usize { things.len() } ```
If you have a certain thread (e.g. UI thread) in which you never want to do any cleanup work you can consider using a container like:
``` struct DropElsewhere<T: Send>(pub Option<T>); impl<T: Send> Drop for DropElsewhere<T> { fn drop(&mut self) { if let Some(value) = self.take() { thread::spawn(move || drop(value)); } } } ```
You can optimize this with `ManualDrop` to have close to zero-runtime overhead (removes the `take` and `if let` part).
Yeah, you're right. In hindsight my comment was poorly thought-out and poorly written.
It's also definitely a zero-cost abstraction as I can see because the manual solution that's equivalent to get_len1() would be to essentially call free() on things. That would ultimately suffer from the same problem.
If you instead define the function to take in a reference (by adding just two `&` characters into your program), the single-threaded case is now almost 100x faster than the multithreaded case.
Here's a link to a Rust Playground with just those two characters changed: https://play.rust-lang.org/?version=stable&mode=debug&editio...
Note that the code that drops the data in a separate thread is not timing the amount of time your CPU is spinning, dropping the data. So while this does decrease the latency of the original thread, the best solution is to avoid copying and then freeing large, complex objects as much as possible. While it is of course necessary to do this sometimes, this particular example is just not one of them. :)
As an aside, I'm somewhat surprised that the Rust compiler isn't inlining and eliminating all the copying and dropping; this would seem to be a classic case where compiler analysis should be able to determine that `a.size()` should be computable without copying `a`, and it should be able to eliminate the function call cost as well. Manually doing this gives the exact same timing as my gist above, so I assume that this is happening when passing a reference, but not happening when passing the object itself.
All you did was move the drop from the `fn_that_drops_heavy_things` to the end of `main`, where it is outside the timing function.
Untrue. Rust uses move semantics (or shallow copys for types that implement the Copy trait via e.g. memcpy - no you can't customize this!) Deep copies require explicitly calling methods like ".clone()". So HashMap's pointers and sizes do get memcpyed... 56 bytes on the 64-bit playground currently.
This is similar to how std::move(...)ing a std::unordered_map in C++ nulls out the old object and just copies the pointers of the container - not a deep copy of the subobjects - which in similar C++ code would turn the main thread's destructor into a noop.
The main difference from C++ is: Rust handles this at the language level instead, and doesn't call the dtor at all on the main thread at all if the value was moved. No need for manual movement logic - it is the default, for everything. Also unlike C++, it also prohibits you from using the old moved-from object at compile time, preventing bugs.
It maybe a net win if this is the UI thread of a desktop app, but overall, it will come at a performance cost: because modern allocators have thread-local memory pools, and now you're moving away from it. And if you're running you code on a NUMA system (most server nowadays), when moving from one thread to another, you can end up freeing non-local memory instead of local one. Also, you won't have any backpressure on your allocations, and you are susceptible to run out of memory (especially because your deallocations now occur more slowly than they should)
Main takeaway: if you use it blindly it's an anti-pattern, but it can be a good idea in its niche: the UI thread of a GUI.
I think if you wanted to do deferred destruction right, ideally you’d mod an allocator to have functions like (alloc_local, alloc_global, free_now, free_deferred) to avoid exhausting memory. Traits could make this ergonomic.
Also I admit I don’t understand why “you won’t have any backpressure on your allocations,” shouldn’t deferred destruction give you more backpressure if anything? I am probably confused.
I think the point is that, if the same thread is doing both allocation and de-allocation, the thread is naturally prevented from allocating too much by the work it must do to de-allocate. If you move the de-allocation to another thread, your first thread may now be allocating like crazy, and the de-allocation thread may not be able to keep up.
In a real GC system, this is not that much of a problem, as the allocator and de-allocator can work with each other (if the allocator can't allocate any more memory, it will generally pause until the de-allocator can provide more memory before failing). But in this naive implementation, the allocator thread can exhaust all available memory and fail, even though there are a lot of objects waiting in the de-allocation queue.
most gc based environments use dedicated threads for gc and finalizers, this is one reason to do so
edit: to be more specific:
your normal flow is to alloc at the top of your function, and at the bottom you dealloc. so in basically every case you are paying the cost of deallocs, but if the alloc is conditional the dealloc is now also conditional which is more branches to predict. the dealloc is probably also handled by functions so you have jumps/calls eating up branch prediction table space
in the gc/offloaded dealloc scenario, your deallocs on the work thread are no longer conditional because you're just handing addresses off to the gc. if your gc is STW you've added 'if (stop_requested) stop()' branches throughout your workload, but those are effectively 0-cost because stop_requested is always false (when it's true, the cost of the mispredict has no significance because your thread is about to suspend). the gc thread is always doing the same thing or waiting, and again when it's about to wait a branch mispredict cost has no significance.
This situation you describe sounds a lot like dealing with garbage-collection cycles, so you give a good recommendation on something to watch out for, as rust performing at the level of a GC'd language removes a big reason for choosing rust.
They all have one thing in common: pampering over a bad design.
In the particular example given, the sub-vector probably come from a common source. One could keep a big buffer (a single allocation) and an array of internal pointers. For example of such a design to hold a large array of text strings, see for example this blog entry and its associated github repo:
https://www.spiria.com/en/blog/desktop-software/optimizing-shared-data/
https://github.com/pierrebai/FastTextContainer
Roughly it is this: struct TextHolder
{
const char* common_buffer;
std::vector<const char*> internal_pointers;
};
This is of course addressing the example, but the underlying message is generally applicable: change your flawed design, don't hide your flaws.If important work must be done in the destructors, it is still better to farm the work out to a thread pool, rather than starting another thread. Again, C++ supports this in its Standard Library, as I think Rust does too.
One could suggest that the only reason to present the idea in Rust is the cynical one that Rust articles get free upvotes on HN.
I don't know what the situation is today, but in the past, the GCC standard library containers had non-trivial destructors when running in debug mode. Ensuring their proper invocation was required to avoid dangling pointers in their book keeping. Non-obvious and painful to debug.
Something like this: https://play.rust-lang.org/?version=stable&mode=debug&editio...
You could have an even more advanced version spawning tasks into something like rayon's thread pool, I assume.
https://www.reddit.com/r/rust/comments/go4xcp/new_crate_defe...
And yes, spawning a thread for every drop is horrible. It's just to prove the concept. The defer_drop crate uses a global worker thread.
This means that if you do a "drop in other thread" and then main exists, the drop might never run. Which is often fine as the exit of main causes process termination and as such will free the memory normally anyway.
But it would be a problem one some systems where memory cleanup on process exit is less reliable. Through such systems are more rare by now I think.
I'm pretty sure Linux will always free process-private memory, and threads, and file descriptors when a process exits.
The only things that can leak in typical cases are some kinds of shared memory and maybe child processes?
As for the rust case, if you squint then it's similar to a concurrent collector.
The points raised in this article are really different:
* don't do slow stuff in your latency-critical path
* threads are a nice way to unload slow stuff that you don't need done right away (especially if you have spare cores)
* dropping can be slow
The first and second points are good, but not really related to rust, deallocations, or the number 10000.
The last point is worth discussing, but still not really related to the number 10000 and barely related to rust. Rust encourages an eager deallocation strategy (kind of like C), whereas many other languages would use a more deferred strategy (like many GCs).
It seems like deferred (e.g. GC) would be better here, because after the main object is dropped, the GC doesn't bother to traverse all of the tiny allocations because they are all dead (unreachable by the root), and it just discards them. But that's not the full story either.
It's not terribly common to build up zillions of allocations and then immediately free them. What's more common is to keep the structure (and its zillions of allocations) around for a while, perhaps making small random modifications, and then eventually freeing them all at once. If using a GC, while the large structure is alive, the GC needs to scan all of those objects, causing a pause each time, which is not great. The eager strategy is also not great: it only needs to traverse the structure once (at deallocation time), but it needs to individually deallocate.
The answer here is to recognize that all of the objects in the structure will be deallocated together. Use a separate region/arena/heap for the entire structure, and wipe out that region/arena/heap when the structure gets dropped. You don't need to traverse anything while the structure is alive, or when it gets dropped.
In rust, probably the most common way to approximate this is by using slices into a larger buffer rather than separate allocations. I wish there was a little better way of doing this, though. It would be awesome if you could make new heaps specific to an object (like a hash table), then allocate the keys/values on that heap. When you drop the structure, the memory disappears without traversal.
https://news.ycombinator.com/item?id=22336284
I actually originally wrote esbuild in Rust and Go, and Go was the clear winner.
The parser written in Go was both faster to compile and faster to execute than the parser in Rust. The Go version compiled something like 100x faster than Rust and ran at something around 10% faster (I forget the exact numbers, sorry). Based on a profile, it looked like the Go version was faster because GC happened on another thread while Rust had to run destructors on the same thread.
ESBuild is a really impressive performance-oriented project:
https://github.com/evanw/esbuild
The Rust version also had other problems. Many places in my code had switch statements that branched over all AST nodes and in Rust that compiles to code which uses stack space proportional to the total stack space used by all branches instead of just the maximum stack space used by any one branch: https://github.com/rust-lang/rust/issues/34283
(copy of lobste.rs comment)
Rust tries to be zero cost while providing abstractions that make it seemingly a high level language but ultimately things like this show that it's not exactly zero cost because abstractions can incur hidden penalties. There needs to be some internal syntax that allows a rust user to explicitly control deallocation when needed.
If I started reading code where people would randomly move a value into another thread and essentially do nothing I would be extremely confused. Any language that begins to rely on "trick" or "hacks" as standard patterns exposes a design flaw.
Maybe if rust provided special syntax that a function can be decorated with so that it does deallocation in another thread automatically? Or maybe an internal function called drop_async...? This would make this pattern an explicit part of the language rather than a strange hack/trick.