Rust: Dropping heavy things in another thread can make your code 10000x faster (opens in new tab)

(abramov.io)

481 pointstimooo6y ago273 comments

273 comments

145 comments · 32 top-level

fpgaminer6y ago· 26 in thread

Some important things I think people should note before blindly commenting:

* The example code is obviously contrived. The real gist is that massive deallocations in the UI thread cause lag, which the example code proves. That very thing can easily happen in the real world.

* I didn't see any difference on my machine between a debug build and a release build.

* The example is preforming 1 _million_ deallocations. That's why it's so pathological. It's not just a "large" vector. It's a vector of 1 million vectors. While that may seem contrived, consider a vector of 1 million strings, something that's not too uncommon, and which would likely suffer the same performance penalty.

* Rust is not copying anything, nor duplicating the structures here. In the example code the structures would be moved, not copied, which costs nothing. The deallocation is taking up 99% of the time.

* As an aside, compilers have used the trick of not free-ing data structures before, because it provides a significant performance boost. Instead of calling free on all those billions of tiny data structures a compiler would generate during its lifetime, they just let them leak. Since a compiler is short lived its not a problem, they get a free lunch (pun unintended), and the OS takes care of cleaning up after all is said and done. My point is that this post isn't theoretical, we do deallocation trickery in the real world.

ckcheng6y ago

>As an aside, compilers have used the trick of not free-ing data structures before, because it provides a significant performance boost. Instead of calling free on all those billions of tiny data structures a compiler would generate during its lifetime, they just let them leak. Since a compiler is short lived its not a problem, they get a free lunch (pun unintended), and the OS takes care of cleaning up after all is said and done. My point is that this post isn't theoretical, we do deallocation trickery in the real world.

This reminds me of the exploding ultimate GC technique [1]:

> on-board software for a missile...chief software engineer said "Of course it leaks". ... They added this much additional memory to the hardware to "support" the leaks. Since the missile will explode when it hits its target or at the end of its flight, the ultimate in garbage collection is performed without programmer intervention.

[1]: https://devblogs.microsoft.com/oldnewthing/20180228-00/?p=98...

all-fakes6y ago

Yes, Java also has a garbage collector that does nothing, called the Epsilon GC, intended for short-lived programs and references for garbage collector benchmarks.[0]

[0]: https://blogs.oracle.com/javamagazine/epsilon-the-jdks-do-no...

inopinatus6y ago

The bumper sticker is “Ada programmers prove the algorithm terminates.”

blub6y ago

I guess the bigger question is why they use dynamic memory allocation in the first place.

rubber_duck6y ago

And then someone tries to use your compiler as a service (code analysis, change triggered compiler) and it's a dead end

qzw6y ago

Well, then that’s not the original use case anymore, and it’ll have to be re-engineered. In the meantime it may have been used for years and the perf difference may have saved many developer-years collectively across its user base. Surely you’re not suggesting that the compiler developers should be prematurely optimizing for future use cases that they may not even have envisioned.

2 more replies

ycombobreaker6y ago

In a world where processes can fork-and-exec, nothing about "as a service" changes that. The compiler would just be reinvoked as needed. Converting it into a persistent process breaks a lot more than just allocation optimizations.

1 more reply

yrro6y ago

As distasteful as leaky code is, is it that bad to run it in a separate process? You get a bit more robustness against crashes as well.

1 more reply

hinkley6y ago

One of my “favorite” snags in perf analysis is that periodicity in allocations can misattribute the cost of allocations to the wrong function.

If I allocate just enough memory, but not too much, then pauses for defragmentation of free space may be costed to the code that calls me.

A solution to this that I’ve seen in soft real time systems is to amortize cleanups across all allocations. Every allocation performs n steps of a cleanup process prior to receiving a block of memory. In which case most of the bad actors have to pay part of the cost of memory overhead.

Might be good for Rust to try something in that general realm, or in the cleanup side may be easier to tack on. On free, set a ceiling for operations and queue what is left. That would at least peak shave.

rini176y ago

Doing unrelated cleanup sounds like flushing CPU cache per every allocation.

1 more reply

papaf6y ago

This deallocation trick is neat but in C and C++ you could use a memory pool to do this.

In theory, you could also use a memory pool in Rust but I think the standard library uses malloc without some way of overriding this behaviour.

vvanders6y ago

You can also use the typed-arena crate[0] or roll your own if you're feeling like cracking open unsafe.

[0] https://crates.io/crates/typed-arena

1 more reply

orf6y ago

You can change the global allocator in any rust project. You can write your own easy enough, or use one like jemalloc

3 more replies

wongarsu6y ago

In a toy raytracer I once wrote in C++, switching from malloc to custom memory pools for small fixed-size objects was a big performance boost. Making free() a noop was another big performance boost, both for deallocation and allocation. Turns out sequentially handing out memory from a big chunk of memory is much easier than keeping track of and reusing empty slots, and it keeps sequentially allocated objects in sequential memory locations.

1 more reply

estebank6y ago

In Rust you could just call `mem::forget` on whatever heavy thing that you're no longer using is before it would get dropped, but then the programmer is effectively responsible for that memory leak not becoming a problematic leak during refactors.

Edit: this will also break any code that relies on Drop being called for clean up, but that is already a "suspect"/incorrect pattern because there are no assurances that it will ever run.

2 more replies

projektfu6y ago

Yeah, I like Apple’s (Next’s) approach of pool allocation for each run through the event loop. Defer dealloc, drop pool at the end.

2 more replies

CoolGuySteve6y ago

Even just keeping a free list and deallocating it’s elements at an idle time is probably cheaper and faster than spawning a thread.

jomohke6y ago

You can easily use a custom global allocator in Rust:

    #[global_allocator]
    static GLOBAL: MyAllocator = MyAllocator;

pjmlp6y ago

In C++/WinRT the same approach is taken, because you cannot just use a memory pool for COM.

indemnity6y ago

Isn’t a Rust “move” implemented as a bit wise copy (e.g. memcpy call)? I see people claiming move has no cost but I’m not sure that is true.

justinpombrio6y ago

EDIT: See kevincox's reply. Rust will bitwise copy the containing type, which is typically very cheap. For example, it you move a String, it will copy the String struct, which contains a couple pointers and a length (or something along those lines). Importantly, it will not copy the underlying char array.

I was thinking of the following code, where I believe the assignment to y is actually free. Though apparently this isn't called a "move".

    let x = <<large owned type like [char; 1000]>>;
    let y = x;

More info: https://doc.rust-lang.org/rust-by-example/scope/move.html

2 more replies

dathinab6y ago

What is bit-wise copied is the pointer to the memory.

I.e. a `HashMap` struct, or `Vec` struct don't directly contain the data.

For example the `Vec` is defined internally as something similar to:

`struct Vec<T> { data: *mut [T], capacity: usize, len: usize, marker: PhantomData<T> }`

(Slightly simplified, not actual Vec type).

So a move of a Vec copies at most 3 usize (24 bytes on 64bit systems), similar thinks apply for a HashMap.

Additionally the copy can often be elided through compiler optimizations.

As a interesting side note a new empty Vec/HashMap will not actually allocate any memory, only once elements get added it will start doing so. This is why it crates vec's of vecs of length 1. Or else it wouldn't need to do "number of element" free calls.

1 more reply

steveklabnik6y ago

Semantically, it is a bit wise copy, yes.

However, these copies can often be elided by optimizations.

ndesaulniers6y ago

> As an aside, compilers have used the trick of not free-ing data structures before

In Clang, this flag is `-Xclang -disable-free`. Not from a Jedi...

GuB-426y ago

I don't know if Rust can do it (unsafe?) but in C and C++, I sometimes end up writing a custom allocator. It is often one of the most significant optimizations.

For example, I had the "million strings" problem once, literally millions. The solution was to put every string into a single large buffer and the pointers in another buffer. Not only I could deallocate everything at once but I also saved a bit of RAM by not aligning (not needed for strings).

jfkebwjsbx6y ago

> That very thing can easily happen in the real world.

Only if badly designed. That is why it is contrived!

> While that may seem contrived, consider a vector of 1 million strings, something that's not too uncommon

A program dealing with a million elements of any kind should not be performing naive allocations to begin with.

> we do deallocation trickery in the real world

Skipping deallocations is an optimization, not a design pattern.

In other words, the code needs to keep the ability to perform the deallocation for debugging, testing, usage as a library, etc.

chowells6y ago· 16 in thread

This is the standard problem with tracing data structures to free them. You frequently run into it with systems based on malloc/free or reference counting. The underlying problem is that freeing the structure takes time proportional to the number of pointers in the structure it has to chase.

Generational/compacting GC has the opposite problem. Garbage collection takes time proportional to the live set, and the amount of memory collected is unimportant.

It's actually a lot to be said for rust that the ownership system lets you transfer freeing responsibility off-thread safely and cheaply in order to not have it block the critical path.

But overall, there's nothing really unexpected here, if you're familiar with memory management.

Jasper_6y ago

One of my favorite papers by Bacon et al expands on this intuition that garbage collection and reference counting are opposite tradeoffs in many respects, and gives a formal theory for it. My views on gc/rc haven't been the same since.

http://researcher.watson.ibm.com/researcher/files/us-bacon/B...

pron6y ago

That's a great paper, but one important thing to point out is that some production-grade tracing GCs are on the sophisticated end of that paper, while almost all reference counting GCs are on the rather primitive end. Given the same amount of effort, it's easier to get a reasonable result with reference-counting, but there are industrial-strength tracing GCs out there that have had a lot of effort put into them.

jeffdavis6y ago

"Generational/compacting GC has the opposite problem. Garbage collection takes time proportional to the live set, and the amount of memory collected is unimportant."

Takes time proportional the live set times the number of GC runs that happen while the objects are alive. In other words, the longer the objects live, the more GC runs have to scan that object (assuming there is enough activity to trigger the GC), and the worse GC looks.

titzer6y ago

This is most decidedly not true for generational GCs, and for concurrent GCs, the tracing work happens asynchronously and in parallel, on other cores, not taking time on the main thread.

2 more replies

Reelin6y ago

> the ownership system lets you transfer freeing responsibility off-thread safely and cheaply in order to not have it block the critical path

This can also trivially be done in other languages. Atomically append your pointer to a queue of "large things that need to be freed" and move on as though you had actually called free.

Within a particularly time sensitive loop you can even opt to place pointers into a preallocated array locally. Then once per loop iteration swap that array with the thread handling the deallocations for you. It eats up a bit of CPU time but can significantly reduce latency.

jacobparker6y ago

OP said safely; what you're describing isn't safe in, say, C++ in the same sense that it is in Rust.

1 more reply

im3w1l6y ago

A lot of C++ code depends on deallocation order for correctness. Like a destructor may want to say bye-bye to a pointed-to-object, and if you reverse order of deallocation, that pointer may be dangling.

Consider this code

    {
        Window a;
        ClickHandler* b = new ClickHandler(&a);
        delete b;
    }

Let's say b tries to deregister itself when it's deleted. This code will work as written. But if you defer the deletion of b, then stack allocated Window a may already be gone.

2 more replies

loufe6y ago

I've not worked with any language thus far without automatic garbage collecting, so this was definitely a neat read for me. It sounds rather elegant.

burpsnard6y ago

It's worth popping the hood and getting your fingers dirty. C was written in an era where memory was a scarce and precious resource to be grudgingly used if absolutely necessary

1 more reply

arcticbull6y ago

> This is the standard problem with tracing data structures to free them. You frequently run into it with systems based on malloc/free or reference counting. The underlying problem is that freeing the structure takes time proportional to the number of pointers in the structure it has to chase.

That doesn't seem to make intuitive sense. A GC has the same problem.

A garbage collector has to traverse the data structure in a similar way to determine whether it (and it's embedded keys and values) are part of the live set or not, and to invoke finalizers. You're beginning your comparison after the mark step, which isn't a fair assessment since what Rust is doing is akin both both the mark and sweep phases.

The only way to drop an extensively nested structure like this any faster than traversing it would be an arena allocator, and forgetting about the entire arena.

The difference between a GC and this kind of memory management is that the GC does the traversal later, at some point, non-deterministically. Rust allows you to decide between deallocating it in place, immediately, or deferring it to a different thread.

chowells6y ago

I said generational/compacting collector. You're talking about a mark and sweep collector.

A generational/compacting collector traverses pointers from the live roots, and copies everything it finds to the start of its memory space, and then declares the rest unused. If there is 1GB of unused memory, it's irrelevant. Only the things that can be reached are even examined.

As I said, this has the opposite problem. When the live set becomes huge, this can drag performance. When the live set is small, it doesn't matter how much garbage it produces, performance is fast.

2 more replies

pron6y ago

> A garbage collector has to traverse the data structure in a similar way to determine whether it (and it's embedded keys and values) are part of the live set or not

Yes, but in practice tracing in a tracing GC is done concurrently and with the help of GC barriers that don't require synchronization and so are generally cheaper than the common mechanisms for reference-counting GC.

> and to invoke finalizers

As others have said, finalizers are very uncommon and, in fact, have been deprecated in Java.

1 more reply

Reelin6y ago

> The only way to drop an extensively nested structure like this any faster than traversing it would be an arena allocator, and forgetting about the entire arena.

Isn't that incompatible with RAII though?

2 more replies

tsimionescu6y ago

> That doesn't seem to make intuitive sense. A GC has the same problem.

> A garbage collector has to traverse the data structure in a similar way to determine whether it (and it's embedded keys and values) are part of the live set or not, and to invoke finalizers.

All garbage collectors start from live objects and only scan those. Then, whatever objects they have not scanned get collected. In copying collectors (like most generational ones), this means that garbage is never touched.

In the mark-and-sweep algorithms, the mark phase still never touches the unreachable objects. However, the sweep phase does need to return those objects to the free list, so it will have to walk them. It will still not do it the same way as malloc/free, as it can walk the heap in order and free unmarked objects as it encounters them, no need to follow pointers, so it may still have better cache performance.

Finalizers introduce extra difficulty, but still the behavior is fundamentally different. What usually happens is that objects which have finalizers are usually remembered in a special list which acts as a GC root itself. When they are only reachable from that list, they get marked so that the finalize will run (usually on a special Finalizer thread). When the Finalizer is finished, and assuming the object was not resurrected, they get removed from the Finalizer list, and now they are not reachable from anywhere at all, so the next GC will finally clean them up. Usually, there is also some API for user code to mark a Finalizable object as 'finalized', which essentially removed it from the Finalizer list early and allows it to be collected as normal, without going through the above process.

And yes, having a large number of finalizable objects in your memory is usually considered a very bad idea. Generally, they are only recommended as a fail-safe measure: you are supposed to do explicit cleaning, but as a fail-safe, to avoid your program crashing in production if a connection or file leak was missed, you also have the Finalizer to throw buckets of water out of your boat (but you should really notice that it is happening and plug that leak, rather then relying on the bucketeer).

mcguire6y ago

Finalizers/destructors do not work well in garbage collected languages, for that very reason.

saagarjha6y ago

Usually in a background thread ;)

1 more reply

saagarjha6y ago· 9 in thread

Why would you ever write a get_size function that drops the object you call it on? Surely in an actual, non-contrived usecase spawning another thread and letting the drop occur there would just be plain worse?

Reelin6y ago

I think the contrived use case is just for illustrative purposes? If I'm understanding correctly, the combination of cleanup code and deallocation can sometimes consume enough time that it's worth dispatching it on another thread. That's hardly specific to Rust though.

As you note that will certainly add some overhead, although that could be minimized by not spawning a fresh thread each time. It could easily reduce latency for a thread the UI is waiting on in many cases.

tedunangst6y ago

It would be helpful to see an example from a real application, too.

1 more reply

epage6y ago

I believe this is contrived to prove a point.

And this isn't just a help in these contrived examples. I believe process cleanup (an extreme case of cleaning up objects) is one of cases where garbage collection performs better because it doesn't have to unwind the stack, call cleanup functions that are not in the cache, and make a lot of `free` calls to the allocator.

I vaguely remember reading about Google killing processes rather than having them clean up correctly, relying on the OS to properly clean up any resources of significance.

Now this doesn't mean you should do this in all cases. Profile first, see if you can avoid the large objects, and then look into deferred de-allocations ... if the timing of resource cleanup meets your application's guarantees.

seventh-chord6y ago

Killing a process without freeing all allocations is, as far as I can tell, routine in C. Especially for memory it makes no sense "freeing" allocations, the whole memory space is getting scrapped anyways. Of course, once you add RAAI the compiler cant reason about which destructors it can skip on program exit, and if programmers are negligent of this you get programs that are slow to close.

2 more replies

Reelin6y ago

> killing processes rather than having them clean up correctly, relying on the OS

I recall Firefox preventing cleanup code from running when you quit a few years ago. Prior to that, quitting with a lot of pages open (ie hundreds) could cause it to lock up for quite some time.

pjmlp6y ago

Not at all, Herb Sutter has a CppCon talk about this kind of optimisations.

It is also the approach taken by C++/WinRT, COM and UWP components get moved into a background cleaning thread, to avoid application pauses on complex data structures reaching zero count.

nickm126y ago

I took this to be a contrived example to illustrate the point. I could imagine a process that creates a big data structure (e.g. parse an xml file), pulls some data out, and then drops the data structure. If you want to use that data sooner, you can push the cleanup off your thread.

ashtonkem6y ago

It’s a contrived example to demonstrate the technique.

Areading3146y ago

Right there is no reason to pass ownership to a function like this.

Ididntdothis6y ago· 8 in thread

I used to do this sometimes with C++ when I realized that clearing out a vector with lots of objects was slow. Is Rust basically based on unique_ptr? One problem with this approach was that you still had to wait for these threads when the application would shut down.

masklinn6y ago

> Is Rust basically based on unique_ptr?

Rust is based on ownership and statically checked move semantics (by default though can be opted out). So each item has a single owner (which is why Rust deals very badly with graphs, and more generally any situation where ownership is unclear) and the compiler will prevent you from using a moved object (unlike C++).

Separately it has a smart pointer which is the dual of unique_ptr (Box), with the guarantee noted above:

    let b = Box::new(1);
    drop(b);
    println!("{}", b);

will not compile because the second line moves the box, after which it can't be used because it's been removed entirely from this scope.

wnoise6y ago

> which is why Rust deals very badly with graphs, and more generally any situation where ownership is unclear

To be fair, so do 90+% of programmers. Much of rust's benefit in safe code is training programmers to avoid code like that where possible, and spreading design patterns that avoid it.

saagarjha6y ago

Rust basically gives the compiler understanding of unique_ptr and prevents you from using it after you’ve moved it.

Ididntdothis6y ago

Would you have to keep track of these threads in Rust? I have done a lot of desktop development where you have to be aware of what happens during shutdown. Seems a lot of server guys write their code under the assumption that it will never shut down.

2 more replies

zozbot2346y ago

> One problem with this approach was that you still had to wait for these threads when the application would shut down.

If you know that an object will live for the rest of the program and not need any finalization logic, Rust allows you to "leak" it and save that overhead on shutdown.

ordu6y ago

You could have just one thread and to kill it at exit. Do not start new threads for each closure that drops object, send closures into one special thread instead.

qcoh6y ago

Out of curiosity, how did you do that in C++?

Ididntdothis6y ago

It depends. Either iterate over the vector and delete the objects or just call clear(). Obviously you have to be sure that nobody else is accessing it at the same time.

cesarb6y ago· 6 in thread

Just be careful, because moving heavy things to be dropped to another thread can change the semantics of the program. For instance, consider what happens if within that heavy thing you had a BufWriter: unless its buffer is empty, dropping it writes the buffer, so now your file is being written and closed in a random moment in the future, instead of being guaranteed to have been sent to the kernel and closed when the function returns.

And it can even be worse if it's holding a limited resource, like a file descriptor or a database connection. That is, I wouldn't recommend using this trick unless you're sure that the only thing the "heavy thing" is holding is memory (and even then, keep in mind that memory can also be a limited resource).

lostmyoldone6y ago

I only know a very little rust, but since it's generally a good practice to never defer writing (or other side effects) to an ambiguous future point in time - with memory allocations as the only plausible exception - is there any way in rust to make sure one doesn't accidentally move complex objects with drop side-effects into other threads?

Granted the way the type system work you usually know the type of a variable quite well, but could this happen with opaque types?

I'm very much out of my depth, but it felt like one of those things that could really bite you if you are unaware, as happened with finalizers in Java decades ago.

masklinn6y ago

> I only know a very little rust, but since it's generally a good practice to never defer writing (or other side effects) to an ambiguous future point in time - with memory allocations as the only plausible exception - is there any way in rust to make sure one doesn't accidentally move complex objects with drop side-effects into other threads?

If you're the one creating the structure, you could opt it out of Send, that'd make it… not sendable. So it wouldn't be able to cross thread-boundaries. For instance Rc is !Send, you simply can not send it across a thread-boundary (because it's a non-threadsafe reference-counting handle).

If you don't control the type, then you'd have to wrap it (newtype pattern) or remember to manually mem::drop it. The latter would obviously have no safety whatsoever, the former you might be able to lint for I guess, though even that is limited or complicated (because of type inference the problematic type might never get explicitly mentioned).

the84726y ago

Considering that writing files can also block the process you probably don't want to have that in your latency-sensitive parts either, so you'll have to optimize that one way or another anyway.

For the more general problem you have can also dedicate more threads to the task or apply backpressure.

ablu6y ago

A while ago I stumbled over a proposal to move a shared pointer (this was C++ code) to a thread in order to trigger the freeing of a legacy data structure there (the multi-thousand delete calls caused the watchdog of the main thread to fail). However, keeping the shared pointer reference in the main thread for too long resulted in the possibility that the "clean-up" thread ran while the main thread still had a hold on the shared pointer... Resulting in a low chance of the "clean-up" thread doing nothing and the main thread still locking up. People here got taught to use shared pointers to prevent memory management errors, but it can really cause a lot of unexpected non-determinism when used blindly.

dirtydroog6y ago

shared_ptr all the things? If so, they may as well write in Java.

1 more reply

usefulcat6y ago

It seems like the caller should ensure that the buffer is written before giving away ownership. Also, what happens if there is an error writing during finalization/destruction/etc? Seems like you'd want to find out about such errors earlier if at all possible.

cperciva6y ago· 6 in thread

If freeing the data structure in question takes this long, how much time are you wasting duplicating the data structure?

bszupnick6y ago

This code doesn't duplicate it. In Rust when a variable is sent as an argument to a function it's "ownership" moves to be in the scope of that function.

https://doc.rust-lang.org/book/ch04-01-what-is-ownership.htm...

cperciva6y ago

You're missing my point. Unless the only thing you want to do with your giant data structure is measure its size, you're not going to be passing ownership of your only copy of it into the get_size function. You're going to be passing in a copy -- hence the cost of duplicating everything.

3 more replies

saagarjha6y ago

I’m actually very curious why it takes this long; is Rust memseting the buffer when dropping it?

Edit: it seems like turning on optimizations seems to improve the situation quite a bit. Not sure why they were profiling the debug build.

Reelin6y ago

> I’m actually very curious why it takes this long; is Rust memseting the buffer when dropping it?

Regardless of memset and optimizations, consider a particularly complicated object which lives on the heap and contains hundreds of other nested objects (which themselves contain nested objects, etc). Now imagine that a significant fraction of them make use of RAII. That cleanup code can't be elided.

That being said, it's a pretty bad example if they were actually profiling the debug build ...

fpgaminer6y ago

> it seems like turning on optimizations seems to improve the situation quite a bit.

I'm not seeing that on my local machine? Were you comparing on the Playground which would be quite variable in its results?

    > cargo build
       Compiling foo v0.1.0 (/private/tmp/foo)
        Finished dev [unoptimized + debuginfo] target(s) in 0.42s
    > ./target/debug/foo
    drop in another thread 52.121µs
    drop in this thread 514.687233ms
    >
    >
    > cargo build --release
       Compiling foo v0.1.0 (/private/tmp/foo)
        Finished release [optimized] target(s) in 0.47s
    > ./target/release/foo
    drop in another thread 48.418µs
    drop in this thread 548.005373ms

1 more reply

firethief6y ago

> Edit: it seems like turning on optimizations seems to improve the situation quite a bit. Not sure why they were profiling the debug build.

This is the most important point in the thread, since it invalidates the results for most purposes.

1 more reply

andrewfromx6y ago· 6 in thread

hmm my first thought its, having to do that is a lot like c and cleaning up my own allocations. This feels like something rust should automatically do for me?

ashtonkem6y ago

Rust will automatically clean up data that’s left scope, but you can also manually accomplish this by the “drop” function, which is only necessary if you want to cleanup explicitly, such as in a different thread.

Interestingly, the drop function is actually user-creatable. It’s actually an empty function with a very permissive Non-reference argument. The semantics of ownership in Rust makes that sufficient to trigger memory cleanup.

ReaLNero6y ago

In C, if you forget to clean up, you have a memory leak which is hard to track down. In Rust, if you don't do this, you're not sacrificing memory leaks, only performance. A profiler can tell you when you should drop asynchronously.

madmax966y ago

>A profiler can tell you when you should drop asynchronously

Is there any profiler that does this today?

What are the drawbacks with asynchronous drops?

2 more replies

klyrs6y ago

As I understand it, rust is automatically cleaning up, and that can cause glitchy timing. The clever hack is that rust lets you shunt that cleanup process off to another thread when you're the sole owner of that object. You can do the same thing in C, but unlike rust, the cost of cleanup is not hidden by the syntax.

klyrs6y ago

On further reflection... I'm curious about how allocators would handle this -- if you return from this context only to make another heavyweight object, it seems like you'd be trading glitchy timing with allocator contention.

devit6y ago

Because it's impossible to do this automatically in the general case.

In particular, types may not be sendable to other threads, or may have side effects on dropping, and in those cases you would need to rearchitect the code before you can apply this technique.

Also this technique adds overhead, so it should never be used (including not doing it conditionally) if you don't care about latency or if the objects are always small, and the compiler cannot know whether that is the case.

maxton6y ago· 5 in thread

I'm not very familiar with Rust, but I don't understand why you wouldn't just use a reference-to-HeavyThing as the function argument, so that the object isn't moved and then dropped in the `get_size` function?

ehsanu16y ago

If you never drop it, you have a memory leak. If the caller drops it, it's still the same as the `get_size` dropping it in terms of performance impact.

Generally you'd only pass ownership when that's needed for some reason. So this toy example might not be realistic but it does demonstrate the performance impact.

epage6y ago

For these contrived cases, yes, you would just pass a reference to the function but I think the point is to simplify the case down to demonstrate a point.

burpsnard6y ago

In the olden days, it was just out.flush(); out.close();

heavenlyblue6y ago

So the caller of the function still needs to free HeavyThing in the same thread.

Cyph0n6y ago

You’re spot on: this is simply a bad example that you would never see in a real application.

dirtydroog6y ago· 5 in thread

Oh my good god.

I'm hoping this is down to developer naivety rather than being a feature of rust.

wizzwizz46y ago

It's not a feature of Rust; it's a "feature" of the way we design operating systems and processors. This is the same in C.

ReactiveJelly6y ago

The same could happen in C++, I think. Destructors are supposed to be called recursively.

sockgrant6y ago

1) he should pass by reference to avoid the extra copy. So in his example yes it’s dev naivety

2) but somewhere somehow this object will deallocate, so his trick of putting it to another thread would work if the deal location takes awhile. Same for cpp if you have a massive object in a unique ptr. So it’s not a rust issue

renewiltord6y ago

Where's the extra copy? I don't see one. He's moving the struct into the function, getting size and then dropping it.

VWWHFSfQ6y ago

> avoid the extra copy

there is no copy happening here

epage6y ago· 4 in thread

For those wanting a real world example where this can be useful:

I am writing a static site generator. When run in "watch" mode, it deletes everything and starts over (I'd like to reduce these with partial updates but can't always do it). Moving that cleanup to a thread would make "watch" more responsive.

elcomet6y ago

That's not really the same issue that is mentionned in the article though, is it ?

The issue from the article would be solved by just passing a reference to the variable.

In your case, cleanup is an action that needs to be done before writing new files. So you have to wait for cleanup anyway, don't you ?

ashtonkem6y ago

That's not true.

Typically any server with a watch functionality will have a mutable reference to the data that's being watched. When you change that data out you're both changing the mutable reference, and also deallocating any memory that was previously used. One could separate these two steps, moving the watched data to another variable that's dropped in another thread, if you wanted.

firethief6y ago

Why can't it cleanup right after the work?

pmontra6y ago

Or no cleanup at all. A CLI command that runs for a very short time can allocate memory to perform its job, print the result and exit. Then the OS releases all the memory of the process. No idea if Rust can work like this.

2 more replies

cs7026y ago· 4 in thread

In other words, Rust's automagical memory deallocation is NOT a zero-cost abstraction:

  fn get_len1(things: HeavyThings) -> usize {
      things.len()
  }

  fn get_len2(things: HeavyThings) -> usize {
      let len = things.len();
      thread::spawn(move || drop(things));
      len
  }

The OP shows an example in which a function like get_len2 is 10000x faster than a function like get_len1 for a hashmap with 1M keys.

See also this comment by chowells: https://news.ycombinator.com/item?id=23362925

dathinab6y ago

No the zero-cost refers to the abstraction (and runtime cost), which still is zero-cost. Deallocating is part of the normal work load not the abstraction.

Also this isn't rust specific. Most (all?) RAII languages are affected and many GC approaches have this effect, too. Some do add additional abstraction to magically always or sometimes put the de-allocation into another thread.

But de-allocating in another thread is not generally good or bad. There are a lot of use-cases where doing so is rather bad or can't be done (in case TLS is involved). Rust and other similar RAII languages at least let you decide what you want to do.

Now it's (I think) generally known that certain kinds (not all) of GC do make some thinks simpler for GUI-like usage. Through they also tend to have less control.

Note that it's a common pattern for small user CLI facing tools (which are not GC'ed) to leak resources instead of cleaning them up properly. You can do so in rust too if you want but it's a potential problem for longer running applications.

Also here is a faster get `get_len` then both which is also more idiomatic rust then both:

``` fn get_len1(things: &HeavyThings) -> usize { things.len() } ```

If you have a certain thread (e.g. UI thread) in which you never want to do any cleanup work you can consider using a container like:

``` struct DropElsewhere<T: Send>(pub Option<T>); impl<T: Send> Drop for DropElsewhere<T> { fn drop(&mut self) { if let Some(value) = self.take() { thread::spawn(move || drop(value)); } } } ```

You can optimize this with `ManualDrop` to have close to zero-runtime overhead (removes the `take` and `if let` part).

cs7026y ago

> No the zero-cost refers to the abstraction (and runtime cost), which still is zero-cost. Deallocating is part of the normal work load not the abstraction.

Yeah, you're right. In hindsight my comment was poorly thought-out and poorly written.

DasIch6y ago

Nothing about how Rust handles deallocation is magical in any way.

It's also definitely a zero-cost abstraction as I can see because the manual solution that's equivalent to get_len1() would be to essentially call free() on things. That would ultimately suffer from the same problem.

cs7026y ago

Yeah, you're right. In hindsight this was a poorly thought-out and poorly written post on my part.

staticfloat6y ago· 4 in thread

It seems that this would be a great reason to not pass the entire heavy object through your function, and to instead pass it as a reference. When passing an object (rather than a reference to an object) there's a lot more work going on both in function setup, and in object dropping. I'm not a rust guru, so I don't know the precise wording, but it's simple enough to realize that if this function, as claimed, must drop all the sub-objects within the `HeavyObject` type, then those objects must have been copied from the original object.

If you instead define the function to take in a reference (by adding just two `&` characters into your program), the single-threaded case is now almost 100x faster than the multithreaded case.

Here's a link to a Rust Playground with just those two characters changed: https://play.rust-lang.org/?version=stable&mode=debug&editio...

Note that the code that drops the data in a separate thread is not timing the amount of time your CPU is spinning, dropping the data. So while this does decrease the latency of the original thread, the best solution is to avoid copying and then freeing large, complex objects as much as possible. While it is of course necessary to do this sometimes, this particular example is just not one of them. :)

As an aside, I'm somewhat surprised that the Rust compiler isn't inlining and eliminating all the copying and dropping; this would seem to be a classic case where compiler analysis should be able to determine that `a.size()` should be computable without copying `a`, and it should be able to eliminate the function call cost as well. Manually doing this gives the exact same timing as my gist above, so I assume that this is happening when passing a reference, but not happening when passing the object itself.

heftig6y ago

As already mentioned, Rust wasn't copying anything; the `HashMap` is not a `Copy`-able type, so it was just moved around (it's also not very large: all its items are behind a pointer to the heap).

All you did was move the drop from the `fn_that_drops_heavy_things` to the end of `main`, where it is outside the timing function.

MaulingMonkey6y ago

> but it's simple enough to realize that if this function, as claimed, must drop all the sub-objects within the `HeavyObject` type, then those objects must have been copied from the original object.

Untrue. Rust uses move semantics (or shallow copys for types that implement the Copy trait via e.g. memcpy - no you can't customize this!) Deep copies require explicitly calling methods like ".clone()". So HashMap's pointers and sizes do get memcpyed... 56 bytes on the 64-bit playground currently.

This is similar to how std::move(...)ing a std::unordered_map in C++ nulls out the old object and just copies the pointers of the container - not a deep copy of the subobjects - which in similar C++ code would turn the main thread's destructor into a noop.

The main difference from C++ is: Rust handles this at the language level instead, and doesn't call the dtor at all on the main thread at all if the value was moved. No need for manual movement logic - it is the default, for everything. Also unlike C++, it also prohibits you from using the old moved-from object at compile time, preventing bugs.

fpgaminer6y ago

Rust isn't copying anything; everything in the original code would be a move.

heavenlyblue6y ago

If your function takes a reference to the object, something still needs to free it.

littlestymaar6y ago· 3 in thread

The title is slightly wrong: it's not going to make your code faster, it's going to reduce latency on the given thread.

It maybe a net win if this is the UI thread of a desktop app, but overall, it will come at a performance cost: because modern allocators have thread-local memory pools, and now you're moving away from it. And if you're running you code on a NUMA system (most server nowadays), when moving from one thread to another, you can end up freeing non-local memory instead of local one. Also, you won't have any backpressure on your allocations, and you are susceptible to run out of memory (especially because your deallocations now occur more slowly than they should)

Main takeaway: if you use it blindly it's an anti-pattern, but it can be a good idea in its niche: the UI thread of a GUI.

pshc6y ago

Yes it’ll reduce latency, but doesn’t it also increase parallelism? A single-threaded program ought to improve overall, unless the extra overhead you mentioned dominates. A parallel program might improve or not.

I think if you wanted to do deferred destruction right, ideally you’d mod an allocator to have functions like (alloc_local, alloc_global, free_now, free_deferred) to avoid exhausting memory. Traits could make this ergonomic.

Also I admit I don’t understand why “you won’t have any backpressure on your allocations,” shouldn’t deferred destruction give you more backpressure if anything? I am probably confused.

tsimionescu6y ago

> Also I admit I don’t understand why “you won’t have any backpressure on your allocations,” shouldn’t deferred destruction give you more backpressure if anything? I am probably confused.

I think the point is that, if the same thread is doing both allocation and de-allocation, the thread is naturally prevented from allocating too much by the work it must do to de-allocate. If you move the de-allocation to another thread, your first thread may now be allocating like crazy, and the de-allocation thread may not be able to keep up.

In a real GC system, this is not that much of a problem, as the allocator and de-allocator can work with each other (if the allocator can't allocate any more memory, it will generally pause until the de-allocator can provide more memory before failing). But in this naive implementation, the allocator thread can exhaust all available memory and fail, even though there are a lot of objects waiting in the de-allocation queue.

2 more replies

kevingadd6y ago

this could make your code faster by providing more consistent control flow: the main thread is always doing Work and your gc threads are always cleaning up dead objects. this provides fewer, better-predicted branches and code that's more likely to stay in the icache.

most gc based environments use dedicated threads for gc and finalizers, this is one reason to do so

edit: to be more specific:

your normal flow is to alloc at the top of your function, and at the bottom you dealloc. so in basically every case you are paying the cost of deallocs, but if the alloc is conditional the dealloc is now also conditional which is more branches to predict. the dealloc is probably also handled by functions so you have jumps/calls eating up branch prediction table space

in the gc/offloaded dealloc scenario, your deallocs on the work thread are no longer conditional because you're just handing addresses off to the gc. if your gc is STW you've added 'if (stop_requested) stop()' branches throughout your workload, but those are effectively 0-cost because stop_requested is always false (when it's true, the cost of the mispredict has no significance because your thread is about to suspend). the gc thread is always doing the same thing or waiting, and again when it's about to wait a branch mispredict cost has no significance.

jkoudys6y ago· 2 in thread

It'd be interesting to implement this on a type that would defer all of these drop threads (or one big drop threads built off a bunch of futures) until the end of some major action, like sending the http response on an actix-web thread. Could be a great way to get the fastest possible response time, since then the client has their response before any delay on cleanup.

AaronFriel6y ago

There is no such thing as a free lunch here, so it would reduce the unloaded response time but should have no effect (or a negative impact) on a highly loaded server's response time. I'm finding this out when benchmarking a message passing/queue management system. Anything I do to defer work onto a separate threadpool improves latency up to a point, then reduces throughput.

jkoudys6y ago

If you're bottlenecked, then certainly. There's no free lunch, but for us, problems that can be solved by simply scaling up the resources on the host as relatively cheap as free vs expensive developer time. When we're purely focused on sales and not anywhere close to hitting a full mem/cpu bottleneck, this would bee good.

This situation you describe sounds a lot like dealing with garbage-collection cycles, so you give a good recommendation on something to watch out for, as rust performing at the level of a GC'd language removes a big reason for choosing rust.

andreygrehov6y ago· 2 in thread

Does anyone know how would this work in Go?

echlebek6y ago

Lots to be learned at https://blog.golang.org/ismmkeynote

arendtio6y ago

I have no idea, but my guess is that it doesn't matter, as the deallocation is being done by the garbage collection.

pierrebai6y ago· 1 in thread

I've seen variations on this trick multiple times. Using threads, using a message sent to self, using a list and a timer to do the work "later", using a list and waiting for idle time...

They all have one thing in common: pampering over a bad design.

In the particular example given, the sub-vector probably come from a common source. One could keep a big buffer (a single allocation) and an array of internal pointers. For example of such a design to hold a large array of text strings, see for example this blog entry and its associated github repo:

    https://www.spiria.com/en/blog/desktop-software/optimizing-shared-data/
    https://github.com/pierrebai/FastTextContainer

Roughly it is this:

    struct TextHolder
    {
        const char* common_buffer;
        std::vector<const char*> internal_pointers;
    };

This is of course addressing the example, but the underlying message is generally applicable: change your flawed design, don't hide your flaws.

viraptor6y ago

Yes. There's also a number of pool/arena allocators in rust which could be used here instead to drop All entries at once.

ncmncm6y ago· 1 in thread

There is nothing unique to Rust about this; it is a very old technique. It is usually much inferior to the "arena allocator" method, where all the discarded allocations are coalesced and released in a single, cheap operation that could as well be done without another thread. That method is practical in many languages, Rust possibly included. C++ supports it in the Standard Library, for all the standard containers.

If important work must be done in the destructors, it is still better to farm the work out to a thread pool, rather than starting another thread. Again, C++ supports this in its Standard Library, as I think Rust does too.

One could suggest that the only reason to present the idea in Rust is the cynical one that Rust articles get free upvotes on HN.

ShroudedNight6y ago

> C++ supports it in the Standard Library, for all the standard containers.

I don't know what the situation is today, but in the past, the GCC standard library containers had non-trivial destructors when running in debug mode. Ensuring their proper invocation was required to avoid dangling pointers in their book keeping. Non-obvious and painful to debug.

heftig6y ago· 1 in thread

If I seriously wanted to move object destruction off-thread, I would use at least a dedicated thread with a channel, so I could make sure the dropper is done at some point (before the program terminates, at the latest). It also avoids starting and stopping threads constantly.

Something like this: https://play.rust-lang.org/?version=stable&mode=debug&editio...

You could have an even more advanced version spawning tasks into something like rayon's thread pool, I assume.

ReactiveJelly6y ago

Someone is working on this as a direct response to this blog:

https://www.reddit.com/r/rust/comments/go4xcp/new_crate_defe...

And yes, spawning a thread for every drop is horrible. It's just to prove the concept. The defer_drop crate uses a global worker thread.

dathinab6y ago· 1 in thread

One thing I just noticed is that the example doesn't make sure to actually run the new thread to completion before the main thread exists.

This means that if you do a "drop in other thread" and then main exists, the drop might never run. Which is often fine as the exit of main causes process termination and as such will free the memory normally anyway.

But it would be a problem one some systems where memory cleanup on process exit is less reliable. Through such systems are more rare by now I think.

ReactiveJelly6y ago

It would have to be a non-desktop system.

I'm pretty sure Linux will always free process-private memory, and threads, and file descriptors when a process exits.

The only things that can leak in typical cases are some kinds of shared memory and maybe child processes?

SilasX6y ago· 1 in thread

Completely different dynamic (because no Rust GC), but this reminds me of how Twitch made their server, written in Go, a lot faster by allocating a bunch of dummy memory at the beginning so the garbage collector doesn't trigger nearly as often:

https://news.ycombinator.com/item?id=21670110

the84726y ago

The java equivalent to the Go case would simply be adjusting the -Xms flag. The Go approach is a needlessly convoluted because the runtime doesn't offer any tuning knobs.

As for the rust case, if you squint then it's similar to a concurrent collector.

grogers6y ago· 1 in thread

Contrived examples like this are ridiculous. Creating such a heavy thing is likely even more expensive than tearing it down. So unless you create it on a separate thread, you probably shouldn't be freeing it on a separate one. It's not going to solve your interactivity problem. If you are creating the object on a separate thread then it's already going to be natural to free it on a separate one too.

ReactiveJelly6y ago

Something is better than nothing.

rhacker6y ago· 1 in thread

Pass by reference?

bszupnick6y ago

If you pass by reference the heavy object won't be dropped. If your goal is to drop a heavy object, this is a cool way to do it.

jeffdavis6y ago

Speedup numbers should be given when optimizing constant factors -- e.g. "I made this operation 5X faster using SIMD" or "By employing readahead, I sped up this file copy by 10X".

The points raised in this article are really different:

* don't do slow stuff in your latency-critical path

* threads are a nice way to unload slow stuff that you don't need done right away (especially if you have spare cores)

* dropping can be slow

The first and second points are good, but not really related to rust, deallocations, or the number 10000.

The last point is worth discussing, but still not really related to the number 10000 and barely related to rust. Rust encourages an eager deallocation strategy (kind of like C), whereas many other languages would use a more deferred strategy (like many GCs).

It seems like deferred (e.g. GC) would be better here, because after the main object is dropped, the GC doesn't bother to traverse all of the tiny allocations because they are all dead (unreachable by the root), and it just discards them. But that's not the full story either.

It's not terribly common to build up zillions of allocations and then immediately free them. What's more common is to keep the structure (and its zillions of allocations) around for a while, perhaps making small random modifications, and then eventually freeing them all at once. If using a GC, while the large structure is alive, the GC needs to scan all of those objects, causing a pause each time, which is not great. The eager strategy is also not great: it only needs to traverse the structure once (at deallocation time), but it needs to individually deallocate.

The answer here is to recognize that all of the objects in the structure will be deallocated together. Use a separate region/arena/heap for the entire structure, and wipe out that region/arena/heap when the structure gets dropped. You don't need to traverse anything while the structure is alive, or when it gets dropped.

In rust, probably the most common way to approximate this is by using slices into a larger buffer rather than separate allocations. I wish there was a little better way of doing this, though. It would be awesome if you could make new heaps specific to an object (like a hash table), then allocate the keys/values on that heap. When you drop the structure, the memory disappears without traversal.

chubot6y ago

Looks like Evan Wallace ran into the same issue in practice in esbuild

https://news.ycombinator.com/item?id=22336284

I actually originally wrote esbuild in Rust and Go, and Go was the clear winner.

The parser written in Go was both faster to compile and faster to execute than the parser in Rust. The Go version compiled something like 100x faster than Rust and ran at something around 10% faster (I forget the exact numbers, sorry). Based on a profile, it looked like the Go version was faster because GC happened on another thread while Rust had to run destructors on the same thread.

ESBuild is a really impressive performance-oriented project:

https://github.com/evanw/esbuild

The Rust version also had other problems. Many places in my code had switch statements that branched over all AST nodes and in Rust that compiles to code which uses stack space proportional to the total stack space used by all branches instead of just the maximum stack space used by any one branch: https://github.com/rust-lang/rust/issues/34283

(copy of lobste.rs comment)

Animats6y ago

There's a worse case in deallocation. Tracing through a data structure being released for a long-running program can cause page faults, unused data having been swapped out. This is part of why some programs take far too long to exit.

earthboundkid6y ago

Maybe some sort of “collector” could come by a clean up “garbage” memory periodically to improve performance…

snicker76y ago

I wonder if it might be possible for OS's to provide a fast, asynchronous way of deallocating memory.

wmichelin6y ago

Minor typo, `froget` instead of `forget`

thickice6y ago

Is this applicable for Go as well ?

thePunisher6y ago

The obvious solution would be to borrow the HeavyThing instead of having it dropped inside the function.

crimsonalucard16y ago

I guess choosing when or how a program deallocates is important in a language that's close to the metal.

Rust tries to be zero cost while providing abstractions that make it seemingly a high level language but ultimately things like this show that it's not exactly zero cost because abstractions can incur hidden penalties. There needs to be some internal syntax that allows a rust user to explicitly control deallocation when needed.

If I started reading code where people would randomly move a value into another thread and essentially do nothing I would be extremely confused. Any language that begins to rely on "trick" or "hacks" as standard patterns exposes a design flaw.

Maybe if rust provided special syntax that a function can be decorated with so that it does deallocation in another thread automatically? Or maybe an internal function called drop_async...? This would make this pattern an explicit part of the language rather than a strange hack/trick.

floppy1236y ago

Why should i ever need to drop a heavy object for only getting a size? Not in C++ and also not in Rust, the diffent thread idea is just creativ stupidity, sorry

j / k navigate · click thread line to collapse

273 comments

145 comments · 32 top-level

fpgaminer6y ago· 26 in thread

Some important things I think people should note before blindly commenting:

* The example code is obviously contrived. The real gist is that massive deallocations in the UI thread cause lag, which the example code proves. That very thing can easily happen in the real world.

* I didn't see any difference on my machine between a debug build and a release build.

* Rust is not copying anything, nor duplicating the structures here. In the example code the structures would be moved, not copied, which costs nothing. The deallocation is taking up 99% of the time.

ckcheng6y ago

This reminds me of the exploding ultimate GC technique [1]:

[1]: https://devblogs.microsoft.com/oldnewthing/20180228-00/?p=98...

all-fakes6y ago

Yes, Java also has a garbage collector that does nothing, called the Epsilon GC, intended for short-lived programs and references for garbage collector benchmarks.[0]

[0]: https://blogs.oracle.com/javamagazine/epsilon-the-jdks-do-no...

inopinatus6y ago

The bumper sticker is “Ada programmers prove the algorithm terminates.”

blub6y ago

I guess the bigger question is why they use dynamic memory allocation in the first place.

rubber_duck6y ago

And then someone tries to use your compiler as a service (code analysis, change triggered compiler) and it's a dead end

qzw6y ago

2 more replies

ycombobreaker6y ago

1 more reply

yrro6y ago

As distasteful as leaky code is, is it that bad to run it in a separate process? You get a bit more robustness against crashes as well.

1 more reply

hinkley6y ago

One of my “favorite” snags in perf analysis is that periodicity in allocations can misattribute the cost of allocations to the wrong function.

If I allocate just enough memory, but not too much, then pauses for defragmentation of free space may be costed to the code that calls me.

rini176y ago

Doing unrelated cleanup sounds like flushing CPU cache per every allocation.

1 more reply

papaf6y ago

This deallocation trick is neat but in C and C++ you could use a memory pool to do this.

In theory, you could also use a memory pool in Rust but I think the standard library uses malloc without some way of overriding this behaviour.

vvanders6y ago

You can also use the typed-arena crate[0] or roll your own if you're feeling like cracking open unsafe.

[0] https://crates.io/crates/typed-arena

1 more reply

orf6y ago

You can change the global allocator in any rust project. You can write your own easy enough, or use one like jemalloc

3 more replies

wongarsu6y ago

1 more reply

estebank6y ago

Edit: this will also break any code that relies on Drop being called for clean up, but that is already a "suspect"/incorrect pattern because there are no assurances that it will ever run.

2 more replies

projektfu6y ago

Yeah, I like Apple’s (Next’s) approach of pool allocation for each run through the event loop. Defer dealloc, drop pool at the end.

2 more replies

CoolGuySteve6y ago

Even just keeping a free list and deallocating it’s elements at an idle time is probably cheaper and faster than spawning a thread.

jomohke6y ago

You can easily use a custom global allocator in Rust:

    #[global_allocator]
    static GLOBAL: MyAllocator = MyAllocator;

pjmlp6y ago

In C++/WinRT the same approach is taken, because you cannot just use a memory pool for COM.

indemnity6y ago

Isn’t a Rust “move” implemented as a bit wise copy (e.g. memcpy call)? I see people claiming move has no cost but I’m not sure that is true.

justinpombrio6y ago

I was thinking of the following code, where I believe the assignment to y is actually free. Though apparently this isn't called a "move".

    let x = <<large owned type like [char; 1000]>>;
    let y = x;

More info: https://doc.rust-lang.org/rust-by-example/scope/move.html

2 more replies

dathinab6y ago

What is bit-wise copied is the pointer to the memory.

I.e. a `HashMap` struct, or `Vec` struct don't directly contain the data.

For example the `Vec` is defined internally as something similar to:

`struct Vec<T> { data: *mut [T], capacity: usize, len: usize, marker: PhantomData<T> }`

(Slightly simplified, not actual Vec type).

So a move of a Vec copies at most 3 usize (24 bytes on 64bit systems), similar thinks apply for a HashMap.

Additionally the copy can often be elided through compiler optimizations.

1 more reply

steveklabnik6y ago

Semantically, it is a bit wise copy, yes.

However, these copies can often be elided by optimizations.

ndesaulniers6y ago

> As an aside, compilers have used the trick of not free-ing data structures before

In Clang, this flag is `-Xclang -disable-free`. Not from a Jedi...

GuB-426y ago

I don't know if Rust can do it (unsafe?) but in C and C++, I sometimes end up writing a custom allocator. It is often one of the most significant optimizations.

jfkebwjsbx6y ago

> That very thing can easily happen in the real world.

Only if badly designed. That is why it is contrived!

> While that may seem contrived, consider a vector of 1 million strings, something that's not too uncommon

A program dealing with a million elements of any kind should not be performing naive allocations to begin with.

> we do deallocation trickery in the real world

Skipping deallocations is an optimization, not a design pattern.

In other words, the code needs to keep the ability to perform the deallocation for debugging, testing, usage as a library, etc.

chowells6y ago· 16 in thread

Generational/compacting GC has the opposite problem. Garbage collection takes time proportional to the live set, and the amount of memory collected is unimportant.

It's actually a lot to be said for rust that the ownership system lets you transfer freeing responsibility off-thread safely and cheaply in order to not have it block the critical path.

But overall, there's nothing really unexpected here, if you're familiar with memory management.

Jasper_6y ago

http://researcher.watson.ibm.com/researcher/files/us-bacon/B...

pron6y ago

jeffdavis6y ago

"Generational/compacting GC has the opposite problem. Garbage collection takes time proportional to the live set, and the amount of memory collected is unimportant."

titzer6y ago

This is most decidedly not true for generational GCs, and for concurrent GCs, the tracing work happens asynchronously and in parallel, on other cores, not taking time on the main thread.

2 more replies

Reelin6y ago

> the ownership system lets you transfer freeing responsibility off-thread safely and cheaply in order to not have it block the critical path

This can also trivially be done in other languages. Atomically append your pointer to a queue of "large things that need to be freed" and move on as though you had actually called free.

jacobparker6y ago

OP said safely; what you're describing isn't safe in, say, C++ in the same sense that it is in Rust.

1 more reply

im3w1l6y ago

Consider this code

    {
        Window a;
        ClickHandler* b = new ClickHandler(&a);
        delete b;
    }

Let's say b tries to deregister itself when it's deleted. This code will work as written. But if you defer the deletion of b, then stack allocated Window a may already be gone.

2 more replies

loufe6y ago

I've not worked with any language thus far without automatic garbage collecting, so this was definitely a neat read for me. It sounds rather elegant.

burpsnard6y ago

It's worth popping the hood and getting your fingers dirty. C was written in an era where memory was a scarce and precious resource to be grudgingly used if absolutely necessary

1 more reply

arcticbull6y ago

That doesn't seem to make intuitive sense. A GC has the same problem.

The only way to drop an extensively nested structure like this any faster than traversing it would be an arena allocator, and forgetting about the entire arena.

chowells6y ago

I said generational/compacting collector. You're talking about a mark and sweep collector.

As I said, this has the opposite problem. When the live set becomes huge, this can drag performance. When the live set is small, it doesn't matter how much garbage it produces, performance is fast.

2 more replies

pron6y ago

> A garbage collector has to traverse the data structure in a similar way to determine whether it (and it's embedded keys and values) are part of the live set or not

> and to invoke finalizers

As others have said, finalizers are very uncommon and, in fact, have been deprecated in Java.

1 more reply

Reelin6y ago

> The only way to drop an extensively nested structure like this any faster than traversing it would be an arena allocator, and forgetting about the entire arena.

Isn't that incompatible with RAII though?

2 more replies

tsimionescu6y ago

> That doesn't seem to make intuitive sense. A GC has the same problem.

> A garbage collector has to traverse the data structure in a similar way to determine whether it (and it's embedded keys and values) are part of the live set or not, and to invoke finalizers.

mcguire6y ago

Finalizers/destructors do not work well in garbage collected languages, for that very reason.

saagarjha6y ago

Usually in a background thread ;)

1 more reply

saagarjha6y ago· 9 in thread

Reelin6y ago

tedunangst6y ago

It would be helpful to see an example from a real application, too.

1 more reply

epage6y ago

I believe this is contrived to prove a point.

I vaguely remember reading about Google killing processes rather than having them clean up correctly, relying on the OS to properly clean up any resources of significance.

seventh-chord6y ago

2 more replies

Reelin6y ago

> killing processes rather than having them clean up correctly, relying on the OS

I recall Firefox preventing cleanup code from running when you quit a few years ago. Prior to that, quitting with a lot of pages open (ie hundreds) could cause it to lock up for quite some time.

pjmlp6y ago

Not at all, Herb Sutter has a CppCon talk about this kind of optimisations.

It is also the approach taken by C++/WinRT, COM and UWP components get moved into a background cleaning thread, to avoid application pauses on complex data structures reaching zero count.

nickm126y ago

ashtonkem6y ago

It’s a contrived example to demonstrate the technique.

Areading3146y ago

Right there is no reason to pass ownership to a function like this.

Ididntdothis6y ago· 8 in thread

masklinn6y ago

> Is Rust basically based on unique_ptr?

Separately it has a smart pointer which is the dual of unique_ptr (Box), with the guarantee noted above:

    let b = Box::new(1);
    drop(b);
    println!("{}", b);

will not compile because the second line moves the box, after which it can't be used because it's been removed entirely from this scope.

wnoise6y ago

> which is why Rust deals very badly with graphs, and more generally any situation where ownership is unclear

To be fair, so do 90+% of programmers. Much of rust's benefit in safe code is training programmers to avoid code like that where possible, and spreading design patterns that avoid it.

saagarjha6y ago

Rust basically gives the compiler understanding of unique_ptr and prevents you from using it after you’ve moved it.

Ididntdothis6y ago

2 more replies

zozbot2346y ago

> One problem with this approach was that you still had to wait for these threads when the application would shut down.

If you know that an object will live for the rest of the program and not need any finalization logic, Rust allows you to "leak" it and save that overhead on shutdown.

ordu6y ago

You could have just one thread and to kill it at exit. Do not start new threads for each closure that drops object, send closures into one special thread instead.

qcoh6y ago

Out of curiosity, how did you do that in C++?

Ididntdothis6y ago

It depends. Either iterate over the vector and delete the objects or just call clear(). Obviously you have to be sure that nobody else is accessing it at the same time.

cesarb6y ago· 6 in thread

lostmyoldone6y ago

Granted the way the type system work you usually know the type of a variable quite well, but could this happen with opaque types?

I'm very much out of my depth, but it felt like one of those things that could really bite you if you are unaware, as happened with finalizers in Java decades ago.

masklinn6y ago

the84726y ago

Considering that writing files can also block the process you probably don't want to have that in your latency-sensitive parts either, so you'll have to optimize that one way or another anyway.

For the more general problem you have can also dedicate more threads to the task or apply backpressure.

ablu6y ago

dirtydroog6y ago

shared_ptr all the things? If so, they may as well write in Java.

1 more reply

usefulcat6y ago

cperciva6y ago· 6 in thread

If freeing the data structure in question takes this long, how much time are you wasting duplicating the data structure?

bszupnick6y ago

This code doesn't duplicate it. In Rust when a variable is sent as an argument to a function it's "ownership" moves to be in the scope of that function.

https://doc.rust-lang.org/book/ch04-01-what-is-ownership.htm...

cperciva6y ago

3 more replies

saagarjha6y ago

I’m actually very curious why it takes this long; is Rust memseting the buffer when dropping it?

Edit: it seems like turning on optimizations seems to improve the situation quite a bit. Not sure why they were profiling the debug build.

Reelin6y ago

> I’m actually very curious why it takes this long; is Rust memseting the buffer when dropping it?

That being said, it's a pretty bad example if they were actually profiling the debug build ...

fpgaminer6y ago

> it seems like turning on optimizations seems to improve the situation quite a bit.

I'm not seeing that on my local machine? Were you comparing on the Playground which would be quite variable in its results?

    > cargo build
       Compiling foo v0.1.0 (/private/tmp/foo)
        Finished dev [unoptimized + debuginfo] target(s) in 0.42s
    > ./target/debug/foo
    drop in another thread 52.121µs
    drop in this thread 514.687233ms
    >
    >
    > cargo build --release
       Compiling foo v0.1.0 (/private/tmp/foo)
        Finished release [optimized] target(s) in 0.47s
    > ./target/release/foo
    drop in another thread 48.418µs
    drop in this thread 548.005373ms

1 more reply

firethief6y ago

> Edit: it seems like turning on optimizations seems to improve the situation quite a bit. Not sure why they were profiling the debug build.

This is the most important point in the thread, since it invalidates the results for most purposes.

1 more reply

andrewfromx6y ago· 6 in thread

hmm my first thought its, having to do that is a lot like c and cleaning up my own allocations. This feels like something rust should automatically do for me?

ashtonkem6y ago

ReaLNero6y ago

madmax966y ago

>A profiler can tell you when you should drop asynchronously

Is there any profiler that does this today?

What are the drawbacks with asynchronous drops?

2 more replies

klyrs6y ago

devit6y ago

Because it's impossible to do this automatically in the general case.

In particular, types may not be sendable to other threads, or may have side effects on dropping, and in those cases you would need to rearchitect the code before you can apply this technique.

maxton6y ago· 5 in thread

ehsanu16y ago

If you never drop it, you have a memory leak. If the caller drops it, it's still the same as the `get_size` dropping it in terms of performance impact.

Generally you'd only pass ownership when that's needed for some reason. So this toy example might not be realistic but it does demonstrate the performance impact.

epage6y ago

For these contrived cases, yes, you would just pass a reference to the function but I think the point is to simplify the case down to demonstrate a point.

burpsnard6y ago

In the olden days, it was just out.flush(); out.close();

heavenlyblue6y ago

So the caller of the function still needs to free HeavyThing in the same thread.

Cyph0n6y ago

You’re spot on: this is simply a bad example that you would never see in a real application.

dirtydroog6y ago· 5 in thread

Oh my good god.

I'm hoping this is down to developer naivety rather than being a feature of rust.

wizzwizz46y ago

It's not a feature of Rust; it's a "feature" of the way we design operating systems and processors. This is the same in C.

ReactiveJelly6y ago

The same could happen in C++, I think. Destructors are supposed to be called recursively.

sockgrant6y ago

1) he should pass by reference to avoid the extra copy. So in his example yes it’s dev naivety

renewiltord6y ago

Where's the extra copy? I don't see one. He's moving the struct into the function, getting size and then dropping it.

VWWHFSfQ6y ago

> avoid the extra copy

there is no copy happening here

epage6y ago· 4 in thread

For those wanting a real world example where this can be useful:

elcomet6y ago

That's not really the same issue that is mentionned in the article though, is it ?

The issue from the article would be solved by just passing a reference to the variable.

In your case, cleanup is an action that needs to be done before writing new files. So you have to wait for cleanup anyway, don't you ?

ashtonkem6y ago

That's not true.

firethief6y ago

Why can't it cleanup right after the work?

pmontra6y ago

2 more replies

cs7026y ago· 4 in thread

In other words, Rust's automagical memory deallocation is NOT a zero-cost abstraction:

  fn get_len1(things: HeavyThings) -> usize {
      things.len()
  }

  fn get_len2(things: HeavyThings) -> usize {
      let len = things.len();
      thread::spawn(move || drop(things));
      len
  }

The OP shows an example in which a function like get_len2 is 10000x faster than a function like get_len1 for a hashmap with 1M keys.

See also this comment by chowells: https://news.ycombinator.com/item?id=23362925

dathinab6y ago

No the zero-cost refers to the abstraction (and runtime cost), which still is zero-cost. Deallocating is part of the normal work load not the abstraction.

Now it's (I think) generally known that certain kinds (not all) of GC do make some thinks simpler for GUI-like usage. Through they also tend to have less control.

Also here is a faster get `get_len` then both which is also more idiomatic rust then both:

``` fn get_len1(things: &HeavyThings) -> usize { things.len() } ```

If you have a certain thread (e.g. UI thread) in which you never want to do any cleanup work you can consider using a container like:

``` struct DropElsewhere<T: Send>(pub Option<T>); impl<T: Send> Drop for DropElsewhere<T> { fn drop(&mut self) { if let Some(value) = self.take() { thread::spawn(move || drop(value)); } } } ```

You can optimize this with `ManualDrop` to have close to zero-runtime overhead (removes the `take` and `if let` part).

cs7026y ago

> No the zero-cost refers to the abstraction (and runtime cost), which still is zero-cost. Deallocating is part of the normal work load not the abstraction.

Yeah, you're right. In hindsight my comment was poorly thought-out and poorly written.

DasIch6y ago

Nothing about how Rust handles deallocation is magical in any way.

cs7026y ago

Yeah, you're right. In hindsight this was a poorly thought-out and poorly written post on my part.

staticfloat6y ago· 4 in thread

If you instead define the function to take in a reference (by adding just two `&` characters into your program), the single-threaded case is now almost 100x faster than the multithreaded case.

Here's a link to a Rust Playground with just those two characters changed: https://play.rust-lang.org/?version=stable&mode=debug&editio...

heftig6y ago

As already mentioned, Rust wasn't copying anything; the `HashMap` is not a `Copy`-able type, so it was just moved around (it's also not very large: all its items are behind a pointer to the heap).

All you did was move the drop from the `fn_that_drops_heavy_things` to the end of `main`, where it is outside the timing function.

MaulingMonkey6y ago

> but it's simple enough to realize that if this function, as claimed, must drop all the sub-objects within the `HeavyObject` type, then those objects must have been copied from the original object.

fpgaminer6y ago

Rust isn't copying anything; everything in the original code would be a move.

heavenlyblue6y ago

If your function takes a reference to the object, something still needs to free it.

littlestymaar6y ago· 3 in thread

The title is slightly wrong: it's not going to make your code faster, it's going to reduce latency on the given thread.

Main takeaway: if you use it blindly it's an anti-pattern, but it can be a good idea in its niche: the UI thread of a GUI.

pshc6y ago

Also I admit I don’t understand why “you won’t have any backpressure on your allocations,” shouldn’t deferred destruction give you more backpressure if anything? I am probably confused.

tsimionescu6y ago

> Also I admit I don’t understand why “you won’t have any backpressure on your allocations,” shouldn’t deferred destruction give you more backpressure if anything? I am probably confused.

2 more replies

kevingadd6y ago

most gc based environments use dedicated threads for gc and finalizers, this is one reason to do so

edit: to be more specific:

jkoudys6y ago· 2 in thread

AaronFriel6y ago

jkoudys6y ago

andreygrehov6y ago· 2 in thread

Does anyone know how would this work in Go?

echlebek6y ago

Lots to be learned at https://blog.golang.org/ismmkeynote

arendtio6y ago

I have no idea, but my guess is that it doesn't matter, as the deallocation is being done by the garbage collection.

pierrebai6y ago· 1 in thread

I've seen variations on this trick multiple times. Using threads, using a message sent to self, using a list and a timer to do the work "later", using a list and waiting for idle time...

They all have one thing in common: pampering over a bad design.

    https://www.spiria.com/en/blog/desktop-software/optimizing-shared-data/
    https://github.com/pierrebai/FastTextContainer

Roughly it is this:

    struct TextHolder
    {
        const char* common_buffer;
        std::vector<const char*> internal_pointers;
    };

This is of course addressing the example, but the underlying message is generally applicable: change your flawed design, don't hide your flaws.

viraptor6y ago

Yes. There's also a number of pool/arena allocators in rust which could be used here instead to drop All entries at once.

ncmncm6y ago· 1 in thread

One could suggest that the only reason to present the idea in Rust is the cynical one that Rust articles get free upvotes on HN.

ShroudedNight6y ago

> C++ supports it in the Standard Library, for all the standard containers.

heftig6y ago· 1 in thread

Something like this: https://play.rust-lang.org/?version=stable&mode=debug&editio...

You could have an even more advanced version spawning tasks into something like rayon's thread pool, I assume.

ReactiveJelly6y ago

Someone is working on this as a direct response to this blog:

https://www.reddit.com/r/rust/comments/go4xcp/new_crate_defe...

And yes, spawning a thread for every drop is horrible. It's just to prove the concept. The defer_drop crate uses a global worker thread.

dathinab6y ago· 1 in thread

One thing I just noticed is that the example doesn't make sure to actually run the new thread to completion before the main thread exists.

But it would be a problem one some systems where memory cleanup on process exit is less reliable. Through such systems are more rare by now I think.

ReactiveJelly6y ago

It would have to be a non-desktop system.

I'm pretty sure Linux will always free process-private memory, and threads, and file descriptors when a process exits.

The only things that can leak in typical cases are some kinds of shared memory and maybe child processes?

SilasX6y ago· 1 in thread

https://news.ycombinator.com/item?id=21670110

the84726y ago

The java equivalent to the Go case would simply be adjusting the -Xms flag. The Go approach is a needlessly convoluted because the runtime doesn't offer any tuning knobs.

As for the rust case, if you squint then it's similar to a concurrent collector.

grogers6y ago· 1 in thread

ReactiveJelly6y ago

Something is better than nothing.

rhacker6y ago· 1 in thread

Pass by reference?

bszupnick6y ago

If you pass by reference the heavy object won't be dropped. If your goal is to drop a heavy object, this is a cool way to do it.

jeffdavis6y ago

Speedup numbers should be given when optimizing constant factors -- e.g. "I made this operation 5X faster using SIMD" or "By employing readahead, I sped up this file copy by 10X".

The points raised in this article are really different:

* don't do slow stuff in your latency-critical path

* threads are a nice way to unload slow stuff that you don't need done right away (especially if you have spare cores)

* dropping can be slow

The first and second points are good, but not really related to rust, deallocations, or the number 10000.

chubot6y ago

Looks like Evan Wallace ran into the same issue in practice in esbuild

https://news.ycombinator.com/item?id=22336284

I actually originally wrote esbuild in Rust and Go, and Go was the clear winner.

ESBuild is a really impressive performance-oriented project:

https://github.com/evanw/esbuild

(copy of lobste.rs comment)

Animats6y ago

earthboundkid6y ago

Maybe some sort of “collector” could come by a clean up “garbage” memory periodically to improve performance…

snicker76y ago

I wonder if it might be possible for OS's to provide a fast, asynchronous way of deallocating memory.

wmichelin6y ago

Minor typo, `froget` instead of `forget`

thickice6y ago

Is this applicable for Go as well ?

thePunisher6y ago

The obvious solution would be to borrow the HeavyThing instead of having it dropped inside the function.

crimsonalucard16y ago

I guess choosing when or how a program deallocates is important in a language that's close to the metal.

floppy1236y ago

Why should i ever need to drop a heavy object for only getting a size? Not in C++ and also not in Rust, the diffent thread idea is just creativ stupidity, sorry

j / k navigate · click thread line to collapse