undefined | Better HN

story

0 pointsadgjlsfhk13y ago0 comments

The blog post is largely incoherent for several reasons.

1. It recommends 8 byte counters. This implies making every object in your programming language 8 bytes bigger which is pretty much a non-starter. Almost everyone that actually uses reference counting uses more like 2-4 bits.

2. Reference counting can have arbitrarily long pauses as well. If you decrease the reference count of an object to zero, you have to decrease the reference count of all referenced objects, which can do significant amounts of work (specifically, it will do a lot of work in the cases where regular GC does almost no work).

3. The blog states that atomic writes are "basically free", but that ignores the fact that in multi-threaded code, atomic writes can actually be fairly expensive since each one requires communication between every thread (This is why python still has a GIL)

4. Because no one uses 8 bytes for the count (since you don't want to double your memory consumption), you still need a GC anyway to deal with objects that get too many references.

0 comments

im3w1l3y ago

> 2. Reference counting can have arbitrarily long pauses as well. If you decrease the reference count of an object to zero, you have to decrease the reference count of all referenced objects, which can do significant amounts of work (specifically, it will do a lot of work in the cases where regular GC does almost no work).

The saving grace for the reference counter is that this is deterministic. The pause is an exact function of the allocation / deallocation pattern which is under the programmers control. So the code can be written in such way that it avoids big delays at inopportune times.

kaba03y ago

I'm not sure it's true -- just think about your browser displaying this page. Which DOM elem allocations are alive and which are not? The object graph depends on runtime data, and runtime data only. Some pattern may exist for a given program, but that is just as actionable under a tracing GC.

im3w1l3y ago

Implementing a runtime is that runs other peoples code is a special case of extreme unpredictability.

OskarS3y ago

> Almost everyone that actually uses reference counting uses more like 2-4 bits.

Here is an incomplete list of languages which use 8 bytes for their reference count on 64-bit computers:

1. Rust, in Rc<T> and Arc<T>

2. C++, in std::shared_ptr<T>

3. Objective-C, in NSObject's retainCount

4. Swift, because of Objective-C

5. Python, where reference count is Py_ssize_t

These were literally the first languages i thought of, they all use 64-bit types. I would argue since reference counting is much rarer than GC, these make up the bulk of reference counting use in the real world. "Almost everyone", huh? It's a bit rich accusing the author of being "almost incoherent" and then say this.

> Reference counting can have arbitrarily long pauses as well.

Deallocating can take arbitrarily long, but deallocation does not stop-the-world. It stops the current thread, which is a huge difference. Malloc can take arbitrarily long as well, it's not like it's wait-free.

In addition, the GC pauses in regular programming languages are frequently orders of magnitude longer than deallocation. You would have to deallocate an enormous tree of object at the root for this to be an issue. And GCs have to do that as well, in addition to their regular stop-the-world pauses. This argument is just irrelevant.

> The blog states that atomic writes are "basically free", but that ignores the fact that in multi-threaded code, atomic writes can actually be fairly expensive since each one requires communication between every thread (This is why python still has a GIL)

First off, this is not why Python has a GIL, but lets leave that aside.

Atomic writes are more expensive than non-atomic ones, but they are not slow operations in the grand scheme of things. If you properly implement acquire-release semantics, they are not even that slow under high contention. Compare this to a GC which literally STOPS ALL THREADS, it's nothing.

> you still need a GC anyway to deal with objects that get too many references.

This is just silly. In languages which has both reference counting and traditional garbage collection (e.g. Python), they do it to avoid reference cycles, not because objects get "too many references". That is a ridiculous statement.

In fact! I just realized we do have data for which is more performant: this article describes how Instagram turned of GC for Python and just relied on reference counting. They gained 10% increase in performance:

https://instagram-engineering.com/dismissing-python-garbage-...

I think it's still an open question if reference counting is always faster than GC, but I do not believe you have the technical expertise to evaluate such a claim. Your comment is four paragraphs long and riddled with factual errors. If you want to be convincing, show data that is better than that Instagram case study.

Sinidir3y ago

>First off, this is not why Python has a GIL, but lets leave that aside. Atomic writes are more expensive than non-atomic ones, but they are not slow operations in the grand scheme of things. If you properly implement acquire-release semantics, they are not even that slow under high contention. Compare this to a GC which literally STOPS ALL THREADS, it's nothing.

This is actually part of why Python still has the GIL. A GILECTOMY was attempted and multithreaded atomic refcounting made things a lot slower (going up with the number of threads) and even other methods were not sufficient for performance.

https://www.youtube.com/watch?v=pLqv11ScGsQ

kaba03y ago

> Deallocating can take arbitrarily long, but deallocation does not stop-the-world

And modern GCs only have to stop the current thread to check for which thread-local objects are alive. The alice ones are moved to another generation, making the space reusable for free.

And atomic writes need synchronization which is crazy expensive, I honestly don’t get why you think it isn’t.

Also, just try writing rust/c++ code that relies entirely on RC vs Java in an object heavy workload - I really don’t think it is an open question in any shape or form.

OskarS3y ago

> The alice ones are moved to another generation, making the space reusable for free.

It's pretty hilarious to me that you first say "they have to move it to another generation" and then you say "it's free!" It's like saying "I paid for my dinner, and now I get to eat it for free!"

Also: what do you think `free()` does? All modern memory allocators do this trick, keeping thread-local caches. This is not an advantage of GCs when reference counting does it as well.

Almost all modern GCs are stop-the-world in at least phases, and for good reason: stop-the-world GCs are higher performance. You pay in other ways for skipping that stop.

> And atomic writes need synchronization which is crazy expensive, I honestly don’t get why you think it isn’t.

Because I've actually benchmarked it: https://quick-bench.com/q/ISEetAHOohv-GaEuYR-7MajJgTc

18.5 nanoseconds fits under no reasonable definition of "crazy expensive", not when a regular increment clocks in at 5.9 nanoseconds. And there is extremely few situations where you increment a reference count more than, like, 5 times. It's just not an issue.

This is like cargo cult programming: you've been told these things and never tested them in the real world, and you have all these preconceived notions that just don't stand up to two minutes of scrutiny.

> Also, just try writing rust/c++ code that relies entirely on RC vs Java in an object heavy workload - I really don’t think it is an open question in any shape or form.

Yes, of course garbage collectors are easier to use than reference counting. Nobody has ever disputed this. That is the whole raison d'etre of garbage collectors. This is not what the discussion is about, it's about performance.

I'm done with this thread now, unless anybody can show me any actual data to back anything you say up. It's really tiring. I started this by saying "I don't actually know", and everyone replying to me has been so darn certain of everything they say while being outright incorrect about most things, and without any actual data to back up the rest.

Sinidir3y ago

>Because I've actually benchmarked it: https://quick-bench.com/q/ISEetAHOohv-GaEuYR-7MajJgTc 18.5 nanoseconds fits under no reasonable definition of "crazy expensive", not when a regular increment clocks in at 5.9 nanoseconds. And there is extremely few situations where you increment a reference count more than, like, 5 times. It's just not an issue.

Congratulations. You tested a construct meant for multicore/threading in a single threaded benchmark and then marvel at the low overhead.

Of course you will only start seeing the cost if there is actually contention to operate on the value between multiple threads running simultaniously. See.: https://travisdowns.github.io/blog/2020/07/06/concurrency-co...

hayley-patton3y ago

You'd get very different results doing atomic operations on counters shared between multiple threads, and with other stuff to do that causes your CPU to serialise and run less out-of-order. That single-threaded benchmark is data, but it doesn’t appear awfully relevant to how naive RC barriers perform versus barriers for coalesced RC or tracing.

Indeed the issue with measuring barriers is that measuring the barrier doesn't suffice; one wants to measure how the barrier affects the rest of execution. This entails coming up with programs that are much less trivial than repeatedly incrementing a counter.

kaba03y ago

Moving to another generation can be done completely asyncronously on another thread that likely doesn't do any useful work on a modern, highly parallel hardware. `free` doesn't do much, but `malloc` does -- with the method I am talking about (TLAB in the JVM), you get as fast allocations as it gets, it's nothing more than a NON-ATOMIC pointer bump. Meanwhile malloc has to find an empty space that can fit the object at hand.

> > Also, just try writing rust/c++ code that relies entirely on RC vs Java in an object heavy workload - I really don’t think it is an open question in any shape or form. > Yes, of course garbage collectors are easier to use than reference counting. Nobody has ever disputed this. That is the whole raison d'etre of garbage collectors. This is not what the discussion is about, it's about performance.

I am talking about performance exactly. Java's GC will smoke the hell out of C++'s shared pointers and Rust's (A)RC. Noone said anything about productivity/ease of usage.

And as mentioned by another commenter - your benchmark didn't take into account anything related to parallel execution, which would be the point.

pclmulqdq3y ago

From experience, your benchmark is not correctly estimating how long a non-atomic increment takes. Run 10000 of them back to back inside each benchmark iteration (atomic and non-atomic), and then runcontested.

You will see that the counter increment is about 2-5 cycles, a few hundred ps, and the atomic is on the order of 10 ns uncontended.

If you then introduce contention and a multi-socket setup, the atomic will slow down significantly. Only one thread can touch it at a time, so they have to take turns.

adgjlsfhk1OP3y ago

even ignoring the fact that this isn't counting contention, 18ns is still really high overhead since it's an operation you have to do twice to read any object. a read from l1 cache is about 3 cycles (1-3 ns), so 36 ns to increment and decrement a counter is far from trivial.

barsonme3y ago

Is your benchmark running concurrently? Or single-threaded?

hayley-patton3y ago

> That is a ridiculous statement.

If you do use limited counts, you will need a backup tracing collector; and limited counts are appealing because they save space (and most objects tend to only have a few references [1]).

> In fact! I just realized we do have data for which is more performant: this article describes how Instagram turned of GC for Python and just relied on reference counting. They gained 10% increase in performance:

...because generations in the GC are represented as linked lists, and modifying those linked lists reduces the number of page faults, in turn because pages are shared by copy-on-write between process. That representation isn't the best idea already, and shared structure should be in the oldest generation anyway, as you don't want to collect it (as SBCL does for images, which are loaded with copy-on-write).

[1] http://users.cecs.anu.edu.au/~steveb/pubs/papers/rc-ismm-201... page 4

saagarjha3y ago

Objective-C stores the retain count in the tops bits of an object pointer. The hardware it runs on has been specially tuned to make atomic reference counting faster.

kgeist3y ago

>C++, in std::shared_ptr<T>

In C++, every shared_ptr also allocates a separate "control block" in the heap, so performance is even worse than that.

int_19h3y ago

std::shared_ptr is not meant to be used as the default memory management technique, hence why it's not that important.

But it does mean that C++ is not a good example here. I would argue that Rust isn't, either, since ARC is also an occasional opt-in there rather than the default.

j / k navigate · click thread line to collapse

0 comments

im3w1l3y ago

kaba03y ago

im3w1l3y ago

Implementing a runtime is that runs other peoples code is a special case of extreme unpredictability.

OskarS3y ago

> Almost everyone that actually uses reference counting uses more like 2-4 bits.

Here is an incomplete list of languages which use 8 bytes for their reference count on 64-bit computers:

1. Rust, in Rc<T> and Arc<T>

2. C++, in std::shared_ptr<T>

3. Objective-C, in NSObject's retainCount

4. Swift, because of Objective-C

5. Python, where reference count is Py_ssize_t

> Reference counting can have arbitrarily long pauses as well.

First off, this is not why Python has a GIL, but lets leave that aside.

> you still need a GC anyway to deal with objects that get too many references.

https://instagram-engineering.com/dismissing-python-garbage-...

Sinidir3y ago

https://www.youtube.com/watch?v=pLqv11ScGsQ

kaba03y ago

> Deallocating can take arbitrarily long, but deallocation does not stop-the-world

And modern GCs only have to stop the current thread to check for which thread-local objects are alive. The alice ones are moved to another generation, making the space reusable for free.

And atomic writes need synchronization which is crazy expensive, I honestly don’t get why you think it isn’t.

Also, just try writing rust/c++ code that relies entirely on RC vs Java in an object heavy workload - I really don’t think it is an open question in any shape or form.

OskarS3y ago

> The alice ones are moved to another generation, making the space reusable for free.

It's pretty hilarious to me that you first say "they have to move it to another generation" and then you say "it's free!" It's like saying "I paid for my dinner, and now I get to eat it for free!"

Also: what do you think `free()` does? All modern memory allocators do this trick, keeping thread-local caches. This is not an advantage of GCs when reference counting does it as well.

Almost all modern GCs are stop-the-world in at least phases, and for good reason: stop-the-world GCs are higher performance. You pay in other ways for skipping that stop.

> And atomic writes need synchronization which is crazy expensive, I honestly don’t get why you think it isn’t.

Because I've actually benchmarked it: https://quick-bench.com/q/ISEetAHOohv-GaEuYR-7MajJgTc

> Also, just try writing rust/c++ code that relies entirely on RC vs Java in an object heavy workload - I really don’t think it is an open question in any shape or form.

Sinidir3y ago

Congratulations. You tested a construct meant for multicore/threading in a single threaded benchmark and then marvel at the low overhead.

hayley-patton3y ago

kaba03y ago

I am talking about performance exactly. Java's GC will smoke the hell out of C++'s shared pointers and Rust's (A)RC. Noone said anything about productivity/ease of usage.

And as mentioned by another commenter - your benchmark didn't take into account anything related to parallel execution, which would be the point.

pclmulqdq3y ago

You will see that the counter increment is about 2-5 cycles, a few hundred ps, and the atomic is on the order of 10 ns uncontended.

If you then introduce contention and a multi-socket setup, the atomic will slow down significantly. Only one thread can touch it at a time, so they have to take turns.

adgjlsfhk1OP3y ago

barsonme3y ago

Is your benchmark running concurrently? Or single-threaded?

hayley-patton3y ago

> That is a ridiculous statement.

If you do use limited counts, you will need a backup tracing collector; and limited counts are appealing because they save space (and most objects tend to only have a few references [1]).

[1] http://users.cecs.anu.edu.au/~steveb/pubs/papers/rc-ismm-201... page 4

saagarjha3y ago

Objective-C stores the retain count in the tops bits of an object pointer. The hardware it runs on has been specially tuned to make atomic reference counting faster.

kgeist3y ago

>C++, in std::shared_ptr<T>

In C++, every shared_ptr also allocates a separate "control block" in the heap, so performance is even worse than that.

int_19h3y ago

std::shared_ptr is not meant to be used as the default memory management technique, hence why it's not that important.

But it does mean that C++ is not a good example here. I would argue that Rust isn't, either, since ARC is also an occasional opt-in there rather than the default.

j / k navigate · click thread line to collapse