Thread stacks are effectively manually allocated blocks of memory. You create a thread, which allocates the stack, and as long as the thread lives, the stack is kept alive - it's self-sustaining. The thread must die by explicit programmatic action, which in turn will free its allocated block of stack memory.
Using finalizers at all is usually an anti-pattern in a GC world. The presence of finalizers is a very strong hint that the GC is being used to manage resources other than memory, something that GC is a poor fit for, because other resources almost certainly have no necessary correlation with memory pressure; and GCs usually only monitor GC heap memory pressure.
That's not to say that there aren't plenty of edge cases where you can end up with lots of false roots that artificially lengthen object lifetimes with a conservative GC. Putting a thread stack in your cycle of object references and relying on GC pressure to break the cycle isn't a strongly motivating one to my mind, though.
Of course, there are many other examples of false-roots, but this concrete class caused unexpected OOMs on several applications of our clients, so we made this small sample and used it for sanity checking during implementing precise GC (and then mentioned it in the post).
You also have to use them as a safe-guard against programmers that are not used to manually managing scope failing to manually manage scope. In some cases it's also just not a practical expectation, as the language is structured entirely around the idea that you aren't supposed to be manually managing scope. An open file that weaves its way throughout an application as it's a constant read/write backing for something? That can often be non-trivial to figure out when to call close() on it exactly, and by the time you've written such a system you've just re-invented manual reference counting (error prone) or a garbage collector of some sort anyway, and should have just used a finalizer.
Calling finalizers an anti-pattern only works to the extent that you can ban everything not controlled by GC'd world. Which would be great, except that you can't even do "hello world" with that constraint.
Nope. Memory that is indirectly allocated via FFI is not normally[0] accounted for by GC memory pressure and so it should be managed explicitly, not using finalizers. That memory is invisible to the GC. It won't know to collect it. It won't know when the foreign heap has allocated too much, and it won't know to run a more expensive GC collection to try harder when foreign space gets tight.
(If you have a relatively infinite amount of RAM, or your FFI object sizes are a constant factor of the size of their GC world counterparts and you account for this in your GC max heap size, you may get away with it. But these constraints aren't typical.)
> That can often be non-trivial to figure out when to call close() on it exactly, and by the time you've written such a system you've just re-invented manual reference counting (error prone) or a garbage collector of some sort anyway, and should have just used a finalizer.
You're right that it can be hard to figure these things out. But figure them out you must, for any long-lived program using scarce resources, or you're just creating a problem for the future.
Working these things out is not actually rocket science. It's what we did in the days before GC, and it was actually tedious more than difficult, because the best way to do it meant dogmatically following certain idioms when programming, being rigorous in your error handling and consciously aware of ownership semantics as data flowed around the program.
We even developed a bunch of strategies to make it simpler, e.g. arena allocation and stack marker allocation. You can adapt these approaches for deterministic resource disposal in GC environments too (e.g. keep a list of things to dispose, or markers on a stack of things to dispose).
The biggest wins from GC are from two effects: memory safety and expression-oriented programming. Memory safety means you never get dangling pointers or dynamic type errors from reused memory, a major increase in reliability, as well as making certain types of lock-free programming much easier. Expression-oriented programming means you can safely and easily write functions that take complex values and return complex values without thinking too hard about the ownership of these complex values. This in turn lets you program in a functional style that is much harder without GC.
What GC doesn't give you is a world free of non-functional requirements. You still need to know about memory allocation of your big algorithms and program overall; you need to know where big object graphs get rooted for longer periods of time before dying (the middle age problem[1]), and you need to track the ownership of your resources rigorously, or you will run into resource exhaustion and non-deterministic failure modes - some of the worst kinds of failures.
[0] https://docs.microsoft.com/en-us/dotnet/api/system.gc.addmem...
[1] https://blogs.msdn.microsoft.com/ricom/2003/12/04/mid-life-c...
A 2002 technical report by Hans Boehm discusses these difficulties. http://www.hpl.hp.com/techreports/2002/HPL-2002-335.html
> The thread must die by explicit programmatic action, which in turn will free its allocated block of stack memory.
Simply, no. A thread can recurse through some function activations and then hit an exit statement which terminates it without destroying those activations. Other threads can have pointers into that thread's stack, so the stack must not be reclaimed until they let go.
For instance, a tread allocates a reply box on the stack and calls some message passing API to send a message, whereby it registers the reply box. That API now has a pointer into the thread's stack. Suppose the thread dies before unregistering the reply box. If that can happen, the stack has to stick around. When the API tries to reply to the thread, and finds that the thread is dead, it can dequeue the reply box at that time; then the stack becomes unreachable. If the stack is reclaimed before then, then this messaging API will be traversing bad pointers through this registered message box.
> because other resources almost certainly have no necessary correlation with memory pressure
This is true and there is a way to regard finalization as decoupled from GC.
I have experience implementing an object system in which finalization is treated similarly to C++ destructors.
Objects can expect to have their finalizers explicitly invoked even when they are still live, long before they have become garbage.
One situation in which that happens is if an exception is thrown during the construction of an object.
I also have scoped reclamation construct with-objects which implements RAII. For instance:
(with-objects ((foo (new foo ...))
(bar (new bar ...))
...)
The finalizers of foo and bar are invoked when with-objects terminates. Additionally, if foo throws during its construction, its finalizer is called, and if bar throws during its construction, both are called.Now if those finalizers are not invoked by the time those objects become garbage, then GC will invoke them. Basically a finalizer is invoked and then removed, so if called early, then GC doesn't call it any more.
Situations in which it's undesirable or infeasible to use with-objects are covered by the call-finalizers API: a one-argument function which takes an object, and calls and removes its finalizer, allowing for completely manual finalization.
Of course, this is basically a logical extension of how with-open-file works in Common Lisp, formalized into the object system, integrated with finalizers.
There is a tradeoff: timely reclamation versus the risk created by the possibility of premature reclamation. It's easy to make objects which can be safely used after their finalizer has been called; the risk isn't that of a dangling pointer that will blow up the show. Still, things can malfunction when objects "hollowed out" by finalization are relied upon to still be functional.
> For instance, a tread allocates a reply box on the stack and calls some message passing API to send a message, whereby it registers the reply box. That API now has a pointer into the thread's stack. Suppose the thread dies before unregistering the reply box. If that can happen, the stack has to stick around.
OMG, can such thing really happen and be correct in some language? The thing is, the stack is most often used as method-local storage for that method's variables and objects that are not escaping its scope and thus can be allocated on stack instead of the heap.
Basically, what happens with stack in a language like C or Java when in thread T some method A calls another method B is the following:
T.stack.push(return address)
T.stack.allocateFrame(B)
Then, when B has done everything it wanted to, it clears its frame from the stack and returns by saved address into A. Similarly, when an exception is thrown, it crawls up the stack frame-by-frame clearing them out until it finds the handler.With that general scheme in mind, I see some contradictions in your example:
1. How exactly can a thread correctly die in such a way that frame of method that allocated the reply box on-stack is not cleaned up from it first?
2. Why the method even allocated shared data structure on its own stack (which it must clear on returning to caller) and not in heap?
3. If it did so because it blocks while waiting for the response to appear in stack-allocated placeholder, then why would anyone consider thread which is actively waiting for something dead and try to reclaim its stack?
It's earlier implementations were on RISC chips that had 24 or more GPRs so the implementation was simple: 2 stacks and divide the local registers in half for boxed and unboxed values. This obviously didn't work when porting to x86 which had far fewer registers.
The ARM port I believe uses the non-conservative approach, despite having 1 less register than x64 (the x64 port was derived from the x86 port so uses the same register assignments).
Fondly remembers the separate address and data registers on 68000. Why didn't they go back to this approach for x86_64 (16 registers), now that no one really cares about 32 bit x86?
1: 2 stacks means 2 stack pointers and 2 frame pointers leaving only 12 registers left for values; it's also possible that the SBCL ABI uses a global register for something else as well, which would leave only 11. PowerPC is a really luxurious platform in which you have 32 GPRs so even if you use 8 GPRs for various bookkeeping purposes that leaves 24 remaining, which is enough for pretty much everyone.
I hadn't spent a lot of time thinking about how much faster this is than malloc/free until a question came up the other day here on HN to the extent of "why would anyone dynamically allocate an object that is smaller than a cache line?" In lisp a commonly allocated structure is a CONS cell which is two pointers in size, and is often smaller than the cache line. It would be very wasteful to do a malloc/free of 8 (or 16) bytes, in C but throughput is approximately identical compared to stack allocating them with SBCLs allocator.
while(1) { Integer n = new Integer(....) vs int n = ... }
A C compiler would allocate the n variable on the stack, making the "allocation" completely free. But in a GC:ed language, the n variable would be bump allocated once every loop. That wouldn't in itself be so costly, but every so often, a GC cycle would be needed as the garbage accumulates. Furthermore, in C the address of the n variable stays in place while in a GC:ed language it moves a bit for each loop.
That is why escape analysis is a fruitful optimization. It takes heap allocated objects and attempts to stack allocate them, similar to how a C compiler would do it.
Consider, for example, a copying GC that copies using a depth-first traversal of the object graph and then running it over a tree that your program always processes in breadth-first order.
However, please note, that our conservative GC was also copying (but not full copying) collector, and it also improved locality of references. So an impact on the performance is not so huge as in the case of GC that doesn't provide any compaction at all.
Stack maps can be made a bit smaller by pushing and popping all registers from the stack during gc. That way, you only need to store the values of stack locations in them and not of individual registers.
Btw, the article is really good. This is the kind of stuff I keep coming back to HN for!
https://bugs.openjdk.java.net/browse/JDK-5014723
Good point about inspecting thread's callstack. Indeed, with conservative GC we had a problem: many popular profilers were incompatible with JET because they inspect threads at safepoints and we had none. When we implemented the precise GC, this problem disappeared and it was an additional benefit for us. However, there are alternative ways to gather thread's callstack out of safepoint. We use them to avoid safepoint bias in our profiler. You can read more about it here:
https://www.excelsiorjet.com/blog/articles/portable-profiler...
----
Thanks for kind words! I'm glad that you liked the post!
By comparison, the CLR is a much worse fit, because value types and stack/inline/fixed arrays means false positives would be much higher for some applications.
E.g. for each frame the compiler orders roots first and then other primitives. Then, as you enter the frame, write the number of roots to the stack. When the GC walks the stack it can see precisely which are roots.
Delphi has a variety of reference counted types. Originally just strings, but later dynamic arrays and COM-style interfaces, all use automatic reference counting. Assignment and copying of these types are done via runtime calls, not just for variables of these types, but also structures and static arrays, recursively, that contain these types.
When allocating locals of these types, and of types that contain these types, the compiler also writes out an RTTI structure describing the stack layout, a bit like the activation record was an actual Pascal record. This RTTI is used to correctly decrement references when the stack is popped in normal case, or during exception unwinding.
The RTTI scheme is smart enough to encode several variables of the same type as being like a static array of that type, etc.
All this doesn't help with liveness, of course, which will still be a problem for the code presented in the article. In fact the efficiency of the encoding is in direct opposition to liveness; encoding things as contiguous blocks of pointers will most likely artificially extend their lifetime to the whole call frame.
It's a really really neat paper that I've been itching to implement for a while.
In the 32-bit age, you ran into problems more and more as your heap approaches the GB range. At some point the probability that you end up with a false root that keeps a lot of garbage alive goes to 1.
In the 64-bit age we get a respite, although many systems don't really use 64-bit pointers.
https://blog.plan99.net/modern-garbage-collection-911ef4f8bd... https://news.ycombinator.com/item?id=13218550