Real-Time Garbage Collection Is Real (opens in new tab)

(michaelrbernste.in)

119 pointsMongoose13y ago55 comments

55 comments

37 comments · 10 top-level

mcartyem13y ago· 11 in thread

There have been a lot of posts related to garbage collection lately but none of them touched upon what I see as a crucial issue: why is garbage collection needed to begin with?

Could you do without it? What is the key point that made it necessary?

I'm aware of it being introduced by McCarthy in the original Lisp paper in 1960 of course. But I suspect what McCarthy originally meant is not what garbage collection turned out to be. What I suspect he meant was that there needs to be a way for memory to be managed. malloc/free offer a way for memory to be managed, and presumably they weren't invented until C was nine years later. What McCarthy might have meant is what became malloc/free in C, which doesn't need garbage collection.

C isn't the only flag here. Was there any OS on the IBM 704 used to implement the original Lisp? Did the OS support multiprocessing? Because if it didn't (UNIX wasn't invented until 1969 either) it would make sense for memory to be available for a single process. And it would mean when people said garbage collection they were envisioning malloc/free.

(Also, databases and operating systems can live without whatever makes garbage collection necessary, since they don't use it, and those are pretty complex and fast pieces of software.)

So, what makes garbage collection different than malloc/free, and why is it necessary? I'd love to learn more about that.

jerf13y ago

I think one of the keys to understanding garbage collection is to understand that it is on a continuum of memory management techniques, and the line is a great deal less bright and shining than people often realize. malloc/free is "manual memory management", right?

Well, not really. Truly manual memory management is getting an enormous block of memory from the OS, and fully manually choosing what goes where within that block of memory. This is indeed a real technique used when the situation is dire enough. If you're not doing that, you're deferring something to your automated solution, the only question is, how much?

malloc/free is a combination that still gives you lots of power, but it's not perfect when you start pushing the abstraction hard enough. All malloc/free combinations have some sort of pathological case, where whatever allocation strategy they are choosing is wrong for some reason. That's why there isn't just one implementation, there's many, and some applications can see big wins switching, while others may see losses.

Garbage collection isn't really some sort of binary dichotomous decision vs. malloc/free; both are forms of automated memory management. Garbage collection techniques just step it up, and try to infer when you are done with memory. The disadvantage is, they may not be as smart as a human, the advantage is that they're a heck of a lot more consistent and pay much better attention. Then, even within "garbage collection", you've got a gradient; you may see a language like Rust with mostly deterministic memory management, but with an easy call-out to GC if you need it. You may see a language like Perl, with relatively consistent finalization as soon as an unshared reference leaves a scope, or you may see something that only collects during sweeps. At the far end you get imprecise garbage collectors, such as those used in Go right now (though they are working on it, and have already made a lot of progress), so even within the realm of GC there's a range of precision vs. speed vs. nuances the programmer needs to know to use them.

GC is necessary because one particular point on this large spectrum isn't the right choice for every program. It isn't even the maximally manual solution. In fact, there's even some middle grounds between malloc/free and fully manual memory management, such as arena allocation. There's a lot of fine gradation on this scale.

"(Also, databases and operating systems can live without whatever makes garbage collection necessary, since they don't use it, and those are pretty complex and fast pieces of software.)"

A bold statement. Have you ever heard the term "compaction" used within the context of a database? That's a form of garbage collection. Contrary to what you said, almost all databases have some form of garbage collection. (Probably all the ones you're thinking about.)

As for whether operating systems have garbage collection, it depends on your precise definition of "operating system". Kernels may lack it, but as critical as they are, and as much sheer staggering code they may have for drivers and stuff, conceptually they actually aren't that complicated, compared to a lot of userland things. Broaden your definition and you'll find some form of garbage collection appear again. And if you include "reference counting" approaches as a form of garbage collection, the Linux kernel uses that all over the place. Is reference counting a form of garbage collection? Well, it can actually be a tough call... because it's all on a continuum.

mcartyem13y ago

(thank you for the willingness to write such a detailed response)

There exists a point on the memory management continuum where management starts being more automatic (gc) than manual (malloc/free). I would like to understand the forces surrounding this specific point, right before the scale tips towards automatic.

If you tried to build a dynamic language without automatic management, what would break and why?

2 more replies

dllthomas13y ago

"[T]he advantage is that they're a heck of a lot more consistent and pay much better attention."

This implies that the only reason for garbage collection is programmer fallibility. However, I think there are some cases where there is literally no better way of knowing when everyone is done with a piece of memory than by checking (closures strike me as an area this can happen pretty quick), at which point the only solution is automated garbage collection - be it provided by the runtime, a library, or written by the programmer for the particular program.

malkia13y ago

> This is indeed a real technique used when the situation is dire enough.

Like console video games. Rarely there is malloc/free type of allocations, and most of them are for some third-party API, and often there are ways to avoid it... or the system itself (sometimes C CRT would do that beneath you, sometimes obvious - strdup, sometimes almost - fopen/fclose, and sometimes unexpected - for example certain *printf functions might do it).

betterunix13y ago

"why is garbage collection needed to begin with?"

Well, think about this: what would a declarative language like Prolog look like if you had to manually deallocate memory? At a high enough level of abstraction, it does not make any sense to manually deallocate memory; it may not even make sense to speak about "memory" at all for some abstractions.

"So, what makes garbage collection different than malloc/free, and why is it necessary? I'd love to learn more about that."

It is the same as the difference between having a compiler generate the instruction sequence for setting up stack frames, and writing those instructions yourself in assembly language. As a programmer, your time is probably better spent on higher-level things than how stack frames are being set up. Likewise with allocating and deallocating memory: you will probably be more productive if you spend your time on something else, like the design or logic of your program.

Remember, programming languages exist for programmers, to make us more productive. We are able to do more when we are not being distracted by things like figuring out when it is safe to deallocate memory or ensuring that we do not create memory leaks.

seanmcdirmid13y ago

I don't believe Lisp was even viable without garbage collections given that lists are created and shared left and right, not to mention closures. Sure, McCarthy could have used "malloc," but making the right "free" calls would have been impossible.

Garbage collection is the main reason why lambdas were slow to be added to C++, as the objects they close over have indefinite lifetimes.

> Because if it didn't (UNIX wasn't invented until 1969 either) it would make sense for memory to be available for a single process. And it would mean when people said garbage collection they were envisioning malloc/free.

This doesn't make any sense, sorry. Manual memory management works well when objects have definite lifetimes, and fails miserably when they don't.

Jach13y ago

> why is garbage collection needed to begin with?

Computers don't have infinite memory. If they did, there would be no need for garbage collection. So if your program needs more memory than the system has, you either need to buy more memory, or figure out areas of the existing memory space that can be overwritten, or rewrite your program to use less memory. (Some embedded systems developers never use free() because they know their code is the only running code and know it probably won't use all the device's memory.) You can manually figure out and mark memory that can be overwritten by using free(), which can be error-prone if you accidentally mark a block of memory as available that shouldn't be. Or you can let the programming language runtime figure it out itself according to whatever algorithm (such as, simplistically, "this block of memory won't ever be accessed again in the running code, we can reuse its memory"). Garbage collection algorithms can be a lot more sophisticated and use memory-sharing and other things, but the point is to make memory usage efficient and not something the programmer necessarily needs to worry about.

malkia13y ago

> What is the key point that made it necessary?

Returning complex objects, without requiring them to be manually freed by whoever picked them (caller).

Also in certain languages - garbage collection actually allows saving of memory. If all strings are immutable, then "one", "one" and "one" might be stored in one place, and whoever asked for them would get the same place (you'll also get the benefit of using only pointer comparison, instead of strcmp). For this gc is needed - e.g. you return things, but you don't specify who owns them. It's just that much later an "agency" starts looking from the certain roots to track down which things are still owned, and which not.

For example - cherry tree - start from the root - all cherries on the tree are owned, all on the ground are no longer - they can be collected. Poor analogy, but it might work for some people.

One more point - if you have 1,000,000 mallocs and then frees then this might be much slower than having 1,000,000 allocations through a garbage collector and freeing them.

In the former case there is this form of "micro-management" - e.g. you have manually micro managed every allocation out there, and then you've had manually micro managed to free it.

Instead the garbage collector might free everything in few sweeps. Also the allocation with garbage collector can be sometimes as simple as moving a pointer ahead.

In the past CGI services were notorious for being slow when shutting down the process. A big portion of this was freeing the memory. In fact you don't need this, when a process is shutdown, no memory needs to be freed - e.g. this is a bit like ad-hoc garbage collection.

Another example was Pascal - there were mark/release regions. You can mark something, and then with release you would free all blocks at once allocated after the mark.

Such things are useful, when you don't want to suffer from too much malloc/free, but would like to keep sane non-gc collection scheme.

And lastly reference counting - it's another form of garbage collection/manual freeing, and it's used quite a lot. There is one problem - most of the implementations cannot handle cycles, and all of them suffer the penalty of dealing with the internal counter, which is even worse when more CPU's have to touch it.

fauigerzigerk13y ago

It's the ownership problem that makes garbage collection useful. Tracking who owns which piece of memory at a particular point in time is a lot of work. Every API has to document if and how an object returned as a pointer should be freed and if an incoming pointer continues to be owned by the caller or not. Every single library invents its very own memory management pattern and you have to keep them all in your head or bad things will happen.

Look at libpq for instance. There are several functions that free memory and destroy different types of objects: PQclear, PQfreemem, PQfinish, PQconninfoFree and maybe others that I forget. For some API functions the documentation explicitly states which free function to use, for others it says nothing and for yet others it says not to free anything because another object retains ownership.

C++ tries to solve the problem that libpq has by using RAII and destructors, but it opened a can of worms. Now you have to deal with tons of different types of smart pointers and fancy memory management patterns coming from different libraries and frameworks. And you must never think that smart pointers are really pointers because if you do, bad things will happen:

Can you return the this pointer from a member function? Not if someone else holds a shared_ptr to this. Can you return shared_ptr<ThisClass>(this)? No, because now you may have two different reference counts for the same object. Can you return shared_from_this() after inheriting enable_shared_from_this? Only if at least one shared_ptr already points at this, which depends on how the caller created this. Is SomeClassPtr a SomeClass*, a shared_ptr<SomeClass> or maybe a boost::intrusive_ptr<SomeClass>? Go find the typedef.

So garbage collection is desirable. Unfortunately it seems to be a very difficult problem to solve in the face of large amounts of memory and lots of parallelism, which is exactly where we're headed. I wish Oracle would buy Azul and make their garbage collector available in OpenJDK. The world would be a better place and I would forgive them all the nasty stuff they have done recently :)

bmm6o13y ago

free is just a way for you to tell the C runtime that you are done using a piece of memory and that it may be re-used to handle another malloc request. Of course, the programmer has to know that he is done with a particular piece of memory, and it can be very difficult to determine at what point that is the case. GC just removes the need to call free - some algorithm runs around and determines what memory you are done with, and returns it to the free store for you. In essence, GC allows you to operate under the illusion that you have unlimited memory.

[ignoring destructors/finalizers for now]

0x013y ago

Well, if you have data that can be referenced from more than one spot (for example, an object pointer that is a member of multiple lists), you need avoid calling free until the last reference is gone. One way to track that is to use an incrementing and decrementing reference counter, but then you end up leaking memory as soon as you get cyclical references. A garbage collector solves the problem of cyclical references.

In some cases a garbage collector can turn out to be more efficient too, since there is no need to spend time carefully managing reference counts! :)

rayiner13y ago· 5 in thread

Studying garbage collection is a wonderful education in algorithm engineering. Despite decades of work, there is no "best" GC algorithm. Instead, there are different points on the space of optimizing for space/throughput/latency/completeness/etc. Moreover, the various algorithms are linked by deep correspondences (e.g. Bacon's result that all collectors lie on a spectrum between pure tracing and pure reference counting, and that things like generational collection are hybrids.)

snprbob8613y ago

Saving somebody 10 seconds, here is Bacon's paper:

http://www.cs.virginia.edu/~cs415/reading/bacon-garbage.pdf

apu13y ago

Adding the initial http links it automatically:

http://www.cs.virginia.edu/~cs415/reading/bacon-garbage.pdf

2 more replies

profquail13y ago

The link is returning a 403 now (perhaps too much traffic from this article).

For anyone that still wants to read it, here's a citation -- you can easily find a copy by searching for the title of the paper.

  Bacon, David F., Perry Cheng, and V. T. Rajan.
  "A Unified Theory of Garbage Collection."
  Convergence 6, no. 17 (2004): 32.

Someone13y ago

I agree that studying garbage collectors is a nice pastime, but one can say the same of many other kinds of algorithms. Best memory allocator? Best thread scheduler? Best file system? Best instruction set? Best programming language? Best editor (just kidding)?

All have seen decades of work, and we have some heuristics that we use to gauge quality, but there is no single quality measure. Also, studying each of them is rewarding.

seanmcdirmid13y ago

> Best editor (just kidding)?

Writing your own editor can be as much fun as writing your own garbage collector.

The "garbage collection" book is the only physical one I still own (everything else given away and in e-form now). It is a fascinating book, and gives you insights into PL run-time systems that you wouldn't get as easily studying other algorithms.

2 more replies

a-priori13y ago· 4 in thread

While the "Metronome" has very predictable behaviour that makes it probably the best GC collector for real-time purposes, it still has a maximum GC load before it gets backed up. If it gets backed up too far... forget about timing guarantees because the system will fail. The "Metronome" collector can guarantee a known and tunable GC capacity over time (in terms of objects collected/sec), which is good. But the flip side is that you need to be able to guarantee that your application will never exceed that capacity, at least not for any sustained period of time.

In order to provide hard real-time guarantees in a garbage collected system, you need to know that there is no situation in which the system produce more garbage faster than the collector can collect. With manual deallocation, you can prove that with static analysis. With garbage collection you have to demonstrate it empirically using dynamic analysis. That requires exhaustive testing to make sure you've covered the worst-case scenario.

dllthomas13y ago

I don't see why you could not, in principle, prove it statically.

a-priori13y ago

Maybe there are tools that will do that. My knowledge of the industry is about 5 years out of date, and I was no expert even then. But we had no such solution and, in fact, didn't use dynamic memory allocation at all nevermind newfangled gizmos like garbage collection.

In principle, yes I think it's possible... at least, I don't think it reduces to the halting problem. But it would be tricky. It would be relatively simple to reason statically about the rate of memory allocation (iterate through all paths leading to a 'new' operator), but for this purpose you care about cases where an object becomes garbage and can be deallocated. That occurs when the last reference to the object is overwritten or goes out of scope, which is not so easy to determine in the presence of aliasing.

3 more replies

hga13y ago

Manual deallocation of a heap???

a-priori13y ago

Sorry, I don't follow.

1 more reply

JulianMorrison13y ago· 3 in thread

I wonder if this could be improved also by "time stealing" - if the mutator is idling, it waives its slice, if the GC doesn't expect to collect much, it waives its slice. The result would be more irregular but still able to give guarantees.

jjs13y ago

Should it be improved this way?

A hard real-time system can fail by running a piece of code too soon as well as too late.

OTOH, a soft real-time program like a video game or voice chat could profit from this.

a-priori13y ago

What jjs says is true: this 'time stealing' is non-deterministic, which makes it a no-go for hard real-time systems.

For soft real-time systems, it would be a good idea. It would improve the average-case performance and/or power consumption while still providing the same worst-case guarantees.

wtracy13y ago

Even in a hard real-time system, allowing the garbage collector to "catch up" whenever the mutator is idling sounds reasonable. It should make the system slightly more robust in the face of irregular heap usage.

1 more reply

FooBarWidget13y ago· 2 in thread

Finally a good use of the word "real-time". This is what real-time means, not web apps that stream data over WebSockets.

scotth13y ago

Can't it mean both?

lttlrck13y ago

Yes, providing it's appropriately prefixed with Soft or Hard...

2 more replies

tinco13y ago· 1 in thread

The author starts the piece enthusiastically marvelling at the fact that Real Time Garbage Collectors exist, but the article doesn't go very deep into how this particular one does it.

I myself was a bit disappointed when I read the limitations, which reveal that the simple laws still hold, you can't make these guarantees without exactly knowing the upper limit of the amount of memory you are going to allocate.

In the event that you design a real time system that dynamically allocates and deallocates objects, wouldn't it be almost or just as easy to implement manual memory management (through pools or whatnot) as it would be correctly identify the maximum amount of allocated memory?

eru13y ago

I guess that depends on your tools. Perahps you have an automatic tool to estimate your memory usage? Of course, programmes amenable to this kind of automatic analysis must be written in a particular style---because in general you can prove anything about arbitrary programmes---but that style might still be easier to bear than managing your own memory.

justncase8013y ago· 1 in thread

Would this also be deterministic? That seems like the fatal flaw with most current GC's.

a-priori13y ago

Depends what you mean by 'deterministic' I suppose. Basically the system they're describing assigns a fixed time slice to the collector. In that way, yes it is deterministic in that the application is guaranteed to have the remaining processor time.

What isn't deterministic is the load on the collector. That is determined both by the rate at which your application generates garbage as it performs its work, and how that garbage generation lines up with the collector's time slices.

Under normal circumstances that load should have no affect on your application's behaviour or response times. But unlike a regular collector, this one will not degrade gracefully (becoming steadily slower as GC load increases): it will work 100% normal until it reaches a breaking point, at which time it will fail catastrophically.

pron13y ago

There are actually quite a lot of hard real-time, mission critical systems (mostly defense) using RTSJ (real-time specification for Java) implementations in production, but I don't know how many make use of realtime GC (RTSJ allows for a semi-manual memory management, much simpler than malloc/free but not as easy as a full GC). Some RTSJ implementations have a realtime GC, like IBM's WebSphere Real Time (http://www-03.ibm.com/software/products/us/en/real-time/) -- that's the one using Metronome -- and Aicas Jamaica VM (http://www.aicas.com/). Sun/Oracle also had an RTSJ JVM with a realtime GC (said to be better than Metronome), but it seems to have been discontinued recently.

the847213y ago

Or you could go for pauseless garbage collection, then you only have to concern yourself with the collector throughput keeping up with the allocations.

http://paperhub.s3.amazonaws.com/d14661878f7811e5ee9c43de884...

jwatte13y ago

I'm not sure I agree work based allocators are out. Specifically, if you follow/mark N references for each M words allocated, you can guarantee to not fall behind and still have a strict upper bound on collection cost/run-time jitter. This adds a linear-in-size cost to memory allocation, which already typically has an amortized linear cost (because you touch all memory allocated) so it's analytically very well behaved.

j / k navigate · click thread line to collapse

55 comments

37 comments · 10 top-level

mcartyem13y ago· 11 in thread

There have been a lot of posts related to garbage collection lately but none of them touched upon what I see as a crucial issue: why is garbage collection needed to begin with?

Could you do without it? What is the key point that made it necessary?

(Also, databases and operating systems can live without whatever makes garbage collection necessary, since they don't use it, and those are pretty complex and fast pieces of software.)

So, what makes garbage collection different than malloc/free, and why is it necessary? I'd love to learn more about that.

jerf13y ago

"(Also, databases and operating systems can live without whatever makes garbage collection necessary, since they don't use it, and those are pretty complex and fast pieces of software.)"

mcartyem13y ago

(thank you for the willingness to write such a detailed response)

If you tried to build a dynamic language without automatic management, what would break and why?

2 more replies

dllthomas13y ago

"[T]he advantage is that they're a heck of a lot more consistent and pay much better attention."

malkia13y ago

> This is indeed a real technique used when the situation is dire enough.

betterunix13y ago

"why is garbage collection needed to begin with?"

"So, what makes garbage collection different than malloc/free, and why is it necessary? I'd love to learn more about that."

seanmcdirmid13y ago

Garbage collection is the main reason why lambdas were slow to be added to C++, as the objects they close over have indefinite lifetimes.

This doesn't make any sense, sorry. Manual memory management works well when objects have definite lifetimes, and fails miserably when they don't.

Jach13y ago

> why is garbage collection needed to begin with?

malkia13y ago

> What is the key point that made it necessary?

Returning complex objects, without requiring them to be manually freed by whoever picked them (caller).

For example - cherry tree - start from the root - all cherries on the tree are owned, all on the ground are no longer - they can be collected. Poor analogy, but it might work for some people.

One more point - if you have 1,000,000 mallocs and then frees then this might be much slower than having 1,000,000 allocations through a garbage collector and freeing them.

In the former case there is this form of "micro-management" - e.g. you have manually micro managed every allocation out there, and then you've had manually micro managed to free it.

Instead the garbage collector might free everything in few sweeps. Also the allocation with garbage collector can be sometimes as simple as moving a pointer ahead.

Another example was Pascal - there were mark/release regions. You can mark something, and then with release you would free all blocks at once allocated after the mark.

Such things are useful, when you don't want to suffer from too much malloc/free, but would like to keep sane non-gc collection scheme.

fauigerzigerk13y ago

bmm6o13y ago

[ignoring destructors/finalizers for now]

0x013y ago

In some cases a garbage collector can turn out to be more efficient too, since there is no need to spend time carefully managing reference counts! :)

rayiner13y ago· 5 in thread

snprbob8613y ago

Saving somebody 10 seconds, here is Bacon's paper:

http://www.cs.virginia.edu/~cs415/reading/bacon-garbage.pdf

apu13y ago

Adding the initial http links it automatically:

http://www.cs.virginia.edu/~cs415/reading/bacon-garbage.pdf

2 more replies

profquail13y ago

The link is returning a 403 now (perhaps too much traffic from this article).

For anyone that still wants to read it, here's a citation -- you can easily find a copy by searching for the title of the paper.

  Bacon, David F., Perry Cheng, and V. T. Rajan.
  "A Unified Theory of Garbage Collection."
  Convergence 6, no. 17 (2004): 32.

Someone13y ago

All have seen decades of work, and we have some heuristics that we use to gauge quality, but there is no single quality measure. Also, studying each of them is rewarding.

seanmcdirmid13y ago

> Best editor (just kidding)?

Writing your own editor can be as much fun as writing your own garbage collector.

2 more replies

a-priori13y ago· 4 in thread

dllthomas13y ago

I don't see why you could not, in principle, prove it statically.

a-priori13y ago

3 more replies

hga13y ago

Manual deallocation of a heap???

a-priori13y ago

Sorry, I don't follow.

1 more reply

JulianMorrison13y ago· 3 in thread

jjs13y ago

Should it be improved this way?

A hard real-time system can fail by running a piece of code too soon as well as too late.

OTOH, a soft real-time program like a video game or voice chat could profit from this.

a-priori13y ago

What jjs says is true: this 'time stealing' is non-deterministic, which makes it a no-go for hard real-time systems.

For soft real-time systems, it would be a good idea. It would improve the average-case performance and/or power consumption while still providing the same worst-case guarantees.

wtracy13y ago

1 more reply

FooBarWidget13y ago· 2 in thread

Finally a good use of the word "real-time". This is what real-time means, not web apps that stream data over WebSockets.

scotth13y ago

Can't it mean both?

lttlrck13y ago

Yes, providing it's appropriately prefixed with Soft or Hard...

2 more replies

tinco13y ago· 1 in thread

The author starts the piece enthusiastically marvelling at the fact that Real Time Garbage Collectors exist, but the article doesn't go very deep into how this particular one does it.

eru13y ago

justncase8013y ago· 1 in thread

Would this also be deterministic? That seems like the fatal flaw with most current GC's.

a-priori13y ago

pron13y ago

the847213y ago

Or you could go for pauseless garbage collection, then you only have to concern yourself with the collector throughput keeping up with the allocations.

http://paperhub.s3.amazonaws.com/d14661878f7811e5ee9c43de884...

jwatte13y ago

j / k navigate · click thread line to collapse