undefined | Better HN

0 pointsOskarS3y ago0 comments

> Its actually usually slower than both manual memory management and GC

[citation needed]

You and the blog post are arguing opposite things, and neither of you have shown any evidence. I get that you're arguing that reference counted objects are bigger (to store the reference count) and/or might use double indirection (depending on implementation), which are both bad for caches. It's not a bad argument. But the counter-argument that the blog posts makes is persuasive as well: it's expensive running a GC that scans the heap looking for loose objects, and reference counting does not need to do that. GC is also "stop-the-world" as well unpredictable and jittery in a way reference counting is not.

My instinct is that reference counting is actually faster (which matches my personal experience), but really, this is not an argument you can solve by arguing in the abstract, you need actual data and benchmarks.

0 comments

19 comments · 4 top-level

adgjlsfhk13y ago· 8 in thread

The blog post is largely incoherent for several reasons.

1. It recommends 8 byte counters. This implies making every object in your programming language 8 bytes bigger which is pretty much a non-starter. Almost everyone that actually uses reference counting uses more like 2-4 bits.

2. Reference counting can have arbitrarily long pauses as well. If you decrease the reference count of an object to zero, you have to decrease the reference count of all referenced objects, which can do significant amounts of work (specifically, it will do a lot of work in the cases where regular GC does almost no work).

3. The blog states that atomic writes are "basically free", but that ignores the fact that in multi-threaded code, atomic writes can actually be fairly expensive since each one requires communication between every thread (This is why python still has a GIL)

4. Because no one uses 8 bytes for the count (since you don't want to double your memory consumption), you still need a GC anyway to deal with objects that get too many references.

im3w1l3y ago

> 2. Reference counting can have arbitrarily long pauses as well. If you decrease the reference count of an object to zero, you have to decrease the reference count of all referenced objects, which can do significant amounts of work (specifically, it will do a lot of work in the cases where regular GC does almost no work).

The saving grace for the reference counter is that this is deterministic. The pause is an exact function of the allocation / deallocation pattern which is under the programmers control. So the code can be written in such way that it avoids big delays at inopportune times.

kaba03y ago

I'm not sure it's true -- just think about your browser displaying this page. Which DOM elem allocations are alive and which are not? The object graph depends on runtime data, and runtime data only. Some pattern may exist for a given program, but that is just as actionable under a tracing GC.

1 more reply

OskarSOP3y ago

> Almost everyone that actually uses reference counting uses more like 2-4 bits.

Here is an incomplete list of languages which use 8 bytes for their reference count on 64-bit computers:

1. Rust, in Rc<T> and Arc<T>

2. C++, in std::shared_ptr<T>

3. Objective-C, in NSObject's retainCount

4. Swift, because of Objective-C

5. Python, where reference count is Py_ssize_t

These were literally the first languages i thought of, they all use 64-bit types. I would argue since reference counting is much rarer than GC, these make up the bulk of reference counting use in the real world. "Almost everyone", huh? It's a bit rich accusing the author of being "almost incoherent" and then say this.

> Reference counting can have arbitrarily long pauses as well.

Deallocating can take arbitrarily long, but deallocation does not stop-the-world. It stops the current thread, which is a huge difference. Malloc can take arbitrarily long as well, it's not like it's wait-free.

In addition, the GC pauses in regular programming languages are frequently orders of magnitude longer than deallocation. You would have to deallocate an enormous tree of object at the root for this to be an issue. And GCs have to do that as well, in addition to their regular stop-the-world pauses. This argument is just irrelevant.

> The blog states that atomic writes are "basically free", but that ignores the fact that in multi-threaded code, atomic writes can actually be fairly expensive since each one requires communication between every thread (This is why python still has a GIL)

First off, this is not why Python has a GIL, but lets leave that aside.

Atomic writes are more expensive than non-atomic ones, but they are not slow operations in the grand scheme of things. If you properly implement acquire-release semantics, they are not even that slow under high contention. Compare this to a GC which literally STOPS ALL THREADS, it's nothing.

> you still need a GC anyway to deal with objects that get too many references.

This is just silly. In languages which has both reference counting and traditional garbage collection (e.g. Python), they do it to avoid reference cycles, not because objects get "too many references". That is a ridiculous statement.

In fact! I just realized we do have data for which is more performant: this article describes how Instagram turned of GC for Python and just relied on reference counting. They gained 10% increase in performance:

https://instagram-engineering.com/dismissing-python-garbage-...

I think it's still an open question if reference counting is always faster than GC, but I do not believe you have the technical expertise to evaluate such a claim. Your comment is four paragraphs long and riddled with factual errors. If you want to be convincing, show data that is better than that Instagram case study.

Sinidir3y ago

>First off, this is not why Python has a GIL, but lets leave that aside. Atomic writes are more expensive than non-atomic ones, but they are not slow operations in the grand scheme of things. If you properly implement acquire-release semantics, they are not even that slow under high contention. Compare this to a GC which literally STOPS ALL THREADS, it's nothing.

This is actually part of why Python still has the GIL. A GILECTOMY was attempted and multithreaded atomic refcounting made things a lot slower (going up with the number of threads) and even other methods were not sufficient for performance.

https://www.youtube.com/watch?v=pLqv11ScGsQ

kaba03y ago

> Deallocating can take arbitrarily long, but deallocation does not stop-the-world

And modern GCs only have to stop the current thread to check for which thread-local objects are alive. The alice ones are moved to another generation, making the space reusable for free.

And atomic writes need synchronization which is crazy expensive, I honestly don’t get why you think it isn’t.

Also, just try writing rust/c++ code that relies entirely on RC vs Java in an object heavy workload - I really don’t think it is an open question in any shape or form.

1 more reply

hayley-patton3y ago

> That is a ridiculous statement.

If you do use limited counts, you will need a backup tracing collector; and limited counts are appealing because they save space (and most objects tend to only have a few references [1]).

> In fact! I just realized we do have data for which is more performant: this article describes how Instagram turned of GC for Python and just relied on reference counting. They gained 10% increase in performance:

...because generations in the GC are represented as linked lists, and modifying those linked lists reduces the number of page faults, in turn because pages are shared by copy-on-write between process. That representation isn't the best idea already, and shared structure should be in the oldest generation anyway, as you don't want to collect it (as SBCL does for images, which are loaded with copy-on-write).

[1] http://users.cecs.anu.edu.au/~steveb/pubs/papers/rc-ismm-201... page 4

saagarjha3y ago

Objective-C stores the retain count in the tops bits of an object pointer. The hardware it runs on has been specially tuned to make atomic reference counting faster.

kgeist3y ago

>C++, in std::shared_ptr<T>

In C++, every shared_ptr also allocates a separate "control block" in the heap, so performance is even worse than that.

1 more reply

pclmulqdq3y ago· 5 in thread

The blog post about this is pretty much incoherent, and comes from a bad understanding of performance, atomic, GC algorithms, and reference counting.

The ONLY reason to reference count is when you need GC-like behavior on a few objects, but do not want to impose a GC on all objects. It is a very valuable tool in performance code not because it is fast, but because it allows you to make other things fast. Suggesting that you should reference count the world is patently ridiculous. This is the reason for Rust's Arc and C++ shared_ptr, not that they are faster than a GC.

The blog post completely brushes aside the costs of _atomic_ counter increments and decrements, calling them "just increments." The "atomic" is the key performance problem, not the increment.

Reference counting also makes objects larger so small objects cannot fit inside a machine register.

Modern GC algorithms are very efficient. They do not need to pause the world, and they do not need to be jittery. However, someone who doesn't understand how expensive atomic ops are probably wouldn't understand this either.

elcritch3y ago

You're assuming RC requires atomics. Just like GCs have been improving so have RCs. There are several designs for modern non-atomic RCs with various solutions for passing data between threads. It generally improves performance significantly better than atomic RCs, for the cache reasons you point out.

Though yes reference counting does increase the size of small objects. I find the performance edge depends on the use case. In some scenarios tracing GCs can be faster than non-atomic RCs, but sometimes they're slower. Sometimes I swap between the two just to see. Generally though RCs tend to use less RAM which can be very valuable.

pjmlp3y ago

Without atomics, lock free algorithms are required, which without adequate CPU support are very hard to get right.

1 more reply

eru3y ago

> Suggesting that you should reference count the world is patently ridiculous.

To be fair, Python reference counts the world. And old Python only had reference counting, they added tracing garbage collection later.)

Python is not a fast language, of course. But neither is it a ridiculous language.

nyanpasu643y ago

I think one reason Python has a GIL (and can't execute in parallel on multiple threads in a process or interpreter) is to ensure faster non-atomic reference counting doesn't cause data races and memory errors.

pclmulqdq3y ago

I disagree on the ridiculous part. Python (the language and the interpreter) is generally not considered to be an example of high-quality engineering.

Also, when you have a global interpreter lock, you don't have to do anything inside that interpreter with atomics. Reference counting would be blazing fast in most cases where it can be done without atomic ops.

marginalia_nu3y ago· 2 in thread

Eh, the GC you are describing sounds like something out of 1994. Modern concurrent algorithms do not need to stop execution. This type of code may paradoxically be slower in benchmarks when there isn't any memory churn, as it necessarily introduces barrier instructions at key points to ensure the GC has a consistent view of the memory (although memory consistency is also a concern for reference counting).

adgjlsfhk13y ago

Note that if you can afford stop the world pauses, GCs that use them are much more efficient. Parallel stop the world garbage collectors are by far the best for minimizing runtime, while concurrent GCs have higher overhead, but can guarantee no significant pauses.

marginalia_nu3y ago

Yeah, that's true, but it shouldn't be held against GC that it interrupts the execution, since it really doesn't have to.

kaba03y ago

With all due respect, why do you believe that your experience is meaningful compared to people writing actual industrial-scale runtimes, when RC is the hello world of GC algorithms? It’s not that they don’t know about it, it’s that they studied the topic countless times and found significant difference between the two approaches, to the point that no language considered fast uses RC (or if they do, they do so because it is a useful escape hatch to manual memory management that doesn’t need support from the runtime).

j / k navigate · click thread line to collapse

0 comments

19 comments · 4 top-level

adgjlsfhk13y ago· 8 in thread

The blog post is largely incoherent for several reasons.

4. Because no one uses 8 bytes for the count (since you don't want to double your memory consumption), you still need a GC anyway to deal with objects that get too many references.

im3w1l3y ago

kaba03y ago

1 more reply

OskarSOP3y ago

> Almost everyone that actually uses reference counting uses more like 2-4 bits.

Here is an incomplete list of languages which use 8 bytes for their reference count on 64-bit computers:

1. Rust, in Rc<T> and Arc<T>

2. C++, in std::shared_ptr<T>

3. Objective-C, in NSObject's retainCount

4. Swift, because of Objective-C

5. Python, where reference count is Py_ssize_t

> Reference counting can have arbitrarily long pauses as well.

First off, this is not why Python has a GIL, but lets leave that aside.

> you still need a GC anyway to deal with objects that get too many references.

https://instagram-engineering.com/dismissing-python-garbage-...

Sinidir3y ago

https://www.youtube.com/watch?v=pLqv11ScGsQ

kaba03y ago

> Deallocating can take arbitrarily long, but deallocation does not stop-the-world

And modern GCs only have to stop the current thread to check for which thread-local objects are alive. The alice ones are moved to another generation, making the space reusable for free.

And atomic writes need synchronization which is crazy expensive, I honestly don’t get why you think it isn’t.

Also, just try writing rust/c++ code that relies entirely on RC vs Java in an object heavy workload - I really don’t think it is an open question in any shape or form.

1 more reply

hayley-patton3y ago

> That is a ridiculous statement.

If you do use limited counts, you will need a backup tracing collector; and limited counts are appealing because they save space (and most objects tend to only have a few references [1]).

[1] http://users.cecs.anu.edu.au/~steveb/pubs/papers/rc-ismm-201... page 4

saagarjha3y ago

Objective-C stores the retain count in the tops bits of an object pointer. The hardware it runs on has been specially tuned to make atomic reference counting faster.

kgeist3y ago

>C++, in std::shared_ptr<T>

In C++, every shared_ptr also allocates a separate "control block" in the heap, so performance is even worse than that.

1 more reply

pclmulqdq3y ago· 5 in thread

The blog post about this is pretty much incoherent, and comes from a bad understanding of performance, atomic, GC algorithms, and reference counting.

The blog post completely brushes aside the costs of _atomic_ counter increments and decrements, calling them "just increments." The "atomic" is the key performance problem, not the increment.

Reference counting also makes objects larger so small objects cannot fit inside a machine register.

elcritch3y ago

pjmlp3y ago

Without atomics, lock free algorithms are required, which without adequate CPU support are very hard to get right.

1 more reply

eru3y ago

> Suggesting that you should reference count the world is patently ridiculous.

To be fair, Python reference counts the world. And old Python only had reference counting, they added tracing garbage collection later.)

Python is not a fast language, of course. But neither is it a ridiculous language.

nyanpasu643y ago

pclmulqdq3y ago

I disagree on the ridiculous part. Python (the language and the interpreter) is generally not considered to be an example of high-quality engineering.

marginalia_nu3y ago· 2 in thread

adgjlsfhk13y ago

marginalia_nu3y ago

Yeah, that's true, but it shouldn't be held against GC that it interrupts the execution, since it really doesn't have to.

kaba03y ago

j / k navigate · click thread line to collapse