undefined | Better HN

0 pointspkolaczk6y ago0 comments

You're missing the fact that the tlab pointer is only ever moved forward, so it always points to recently unused memory. Until the reset happens and it points back to the same memory again, the application managed to allocate several megabytes or sometimes hundreds of megabytes, and most of that new-gen memory does not fit even in L3 cache.

The stack pointer moves both directions and the total range of that back-and-forth movement is typically in kilobytes, so it may fit fully in L1.

Just check with perf what happens when you iterate over an array of 100 MB several times and compare that to iterating over 10 kB several times. Both are contiguous but the performance difference is pretty dramatic.

Besides that, there is also an effect that the faster you allocate, the faster you run out of new gen space, and the faster you trigger minor collections. These are not free. The faster you do minor collections, the more likely it is for the objects to survive. And the cost is proportional to survival rate. That's why many Java apps tend to use pretty big new generation size, hoping that before collection happens, most of young objects die.

This is not just theory - I saw this just too many times, when reducing allocation rate to nearly zero caused significant speedups - by order of magnitude of more. Reducing memory traffic is also essential to get good multicore scaling. It doesn't matter each core has a separate tlab, when their total allocation rate is so high that they are saturating LLC - main memory link. It is easy to miss this problem by classic method profiling, because a program with such problem will manifest by just everything being magically slow, but no obvious bottleneck.

0 comments

1 comments · 1 top-level

ernst_klim6y ago

> You're missing the fact that the tlab pointer is only ever moved forward, so it always points to recently unused memory. Until the reset happens and it points back to the same memory again, the application managed to allocate several megabytes or sometimes hundreds of megabytes, and most of that new-gen memory does not fit even in L3 cache.

Yes, you are right about stack locality. It indeed moves back and forward making effective used memory region quite small.

> These are not free. The faster you do minor collections, the more likely it is for the objects to survive. And the cost is proportional to survival rate.

Yes, that's true. Immutable languages are doing way better here having small minor heaps (OCaml has 2MB on amd64) and very small survival rates (with many object being directly allocated on older heap if they are known to be lasting in advance).

Now I understand your point better and I agree.

j / k navigate · click thread line to collapse