Performance of modern Java on data-heavy workloads (opens in new tab)

(jet-start.sh)

202 pointscangencer6y ago98 comments

98 comments

33 comments · 6 top-level

willvarfar6y ago· 15 in thread

A very clear and interesting post.

I've been trying to fit big-enough long-running stuff into JVMs for a few years, and have found that minimizing the amount of garbage is paramount. Its a bit like games- or C programming.

Recent JVM features like 8-bit strings and not having a size-limit on the interned pools etc have been really helpful.

But, for my workloads, the big wastes are still things like java.time.Instant and the overhead of temporary strings (which, these days, copy the underlying data. My code worked better when split strings used to just be views).

There are collections for much more memory-efficient (and faster) maps and things, and also efficient (and fast) JSON parsing etc. I have evaluated and benchmarked and adopted a few of these kinds of things.

Now, when I examine heap-dumps and try and work out where more I can save bytes to keep GC at bay, I mostly see fragments of Instant and String, which are heavily used in my code.

If there was only a library that did date manipulation and arithmetic with longs instead of Instant :(

nicktelford6y ago

> If there was only a library that did date manipulation and arithmetic with longs instead of Instant :(

You can always pass around long timestamps and just convert to Instant whenever you need to do any date/time processing. Provided the Instant doesn't escape the method it's allocated in, it should be optimized via inlining and Scalar Replacement so that it doesn't generate garbage. Of course, you'd be adding in the overhead of dividing up your long in to seconds/nanos each time.

Note: if this doesn't work on OpenJDK, try GraalVM: it's Partial Escape Analysis should do a better job at finding ways of eliding heap allocations.

haxen6y ago

The worst things happen when value objects are stuck on the heap for a while and then turn into garbage when the value is updated. Escape Analysis doesn't help there, only a good GC can help.

Yeroc6y ago

There's a saying that "only the good die young" which applies to Java GC. If your Instants and Strings are really short lived then the GC for those is nearly free. For your workload are these objects living on the heap for long enough to be promoted beyond the young generation?

shellac6y ago

On G1 you used to be able to use `-XX:+UseStringDeduplication`, which gives you back something like the string sharing used pre-whatever it was (8?).

this_user6y ago

If you are looking to do low latency, then G1 isn't the best choice these days, though. Shenandoah or ZGC are both more advanced algorithms that can greatly reduce the pauses caused by GC activity.

1 more reply

nicktelford6y ago

That appears to still be an option, see the last line of: https://docs.oracle.com/en/java/javase/14/gctuning/garbage-f...

jcfrei6y ago

> There are collections for much more memory-efficient (and faster) maps and things, and also efficient (and fast) JSON parsing etc. I have evaluated and benchmarked and adopted a few of these kinds of things.

That sounds very interesting. Can you provide links to the benchmarks for fast JSON parsing (libraries)? And the fast maps?

willvarfar6y ago

I can quickly list our findings (from testing on our specific actual workloads; ymmv etc):

For collections, we used to use trove but migrated to fastutil a few years ago.

For JSON parsing, we are processing lots of very small messages, so use LazyJson. The biggest downside to LazyJson is it doesn't have cheap iteration of keys; the framework could easily provide it. For larger documents, say over a few MB, libraries like Jackson are faster.

Yeah, perhaps Java isn't the right tool for our job. And yeah, more recent benchmarking and testing might suggest newer, better libraries than those I have just listed.

Its horrific the lengths you have to go to to get good performance Java for the workloads we have; python prototypes run much faster with pypy, and I think that is really about heap management more than code generation.

For those of us who know C/C++, its kinda uncomfortable when staring at code and thinking "that temporary string there? 40+ bytes just for the object header!" and things. But, of course, there are advantages to working in memory-safe languages.

C# and its structs, yummy.

5 more replies

rb8086y ago

You're absolutely right. Its one reason I struggle with the modern fashion for immutable classes and FP, they are always making copies of everything, seems crazy.

mumblemumble6y ago

Ideally, a good compiler that understands FP will, behind the scenes, detect when it's safe to mutate the old data rather than creating a copy. That's a big part of why Haskell manages to be neck-and-neck with C despite being functionally pure.

Where it gets tricky is in an environment like the JVM where programming in that style was not anticipated, and introducing any optimizations along these lines for the benefit of the proverbial Scala fans needs to be balanced against the obligation not to adversely impact idiomatic Java code.

That said, even without that, it's not necessarily crazy. It's just a value call: Do you believe that more functional code is easier to maintain, and perhaps value that above raw performance? I'm old enough to remember similar debates about how object-oriented C++ code should be, and to have at least encountered Usenet posts from similar debates about how structured C code should be. I don't bring this up by way of trying to weasel in some "historical inevitability" argument - these are legitimate debates, and there are still problem domains where coding guidelines may discourage, or even prohibit, certain structured programming practices. For very good reasons.

3 more replies

stu20106y ago

Everything is a trade-off. Immutable objects enables easier safety when having multiple threads do work concurrently. Shared mutable state is still very difficult to do correctly, and at the point where you're introducing locks then you've crippled performance.

We have so many cores now that it tends to be a positive trade-off to have many threads doing some wasteful work (copies, extra GC pressure, potentially multiple threads duplicating the same work) than trying to have a perfectly optimized single thread.

1 more reply

logicchains6y ago

>Its one reason I struggle with the modern fashion for immutable classes and FP, they are always making copies of everything, seems crazy

It depends how it's implemented. It's possible to get very nice performance with immutability and copying through use of an arena allocator, as your stuff will essentially always be in cache (due to reusing the arena), and allocation/deallocation is just bumping a pointer. Of course, not everything easily fits into this approach, but a surprisingly large amount of code can, if designed with it in mind (and using a language that supports it without too much pain, like C/C++).

The language Zig is particularly interesting in this regard because everything that allocates takes the allocator as a param, and it has built-in arena allocators in the standard lib.

1 more reply

Cthulhu_6y ago

Disclaimer: I'm not a languages expert, but, I think there's a case to make for performance vs clarity / readability. FP is great for parallelisation, multi-core work, things like web servers and other internet-facing services. But probably not the best for number crunching. Horizontal vs vertical scaling, I think.

I believe you can express your problem (and solution) better using FP, once you have it solved you can zoom in and replace the most demanding segments with iterative programming, or go down lower to the bare metal.

int_19h6y ago

If an object is truly immutable, its object identity shouldn't matter in the vast majority of cases, and so it should be okay to just copy the whole thing, instead of passing around references to it.

Unfortunately, the legacy Java semantics of == means that they can't do this proactively. But didn't Java get opt-in value types recently?

FridgeSeal6y ago

Persistent data structures give you a mixture of performance and immutability and are common in functional programming.

blinkingled6y ago· 6 in thread

I wonder how things would have stacked with OpenJ9 - AdoptOpenJDK project makes OpenJ9 builds available for Java 8/11/13/14 - so it should be trivial to include it in the benchmarks.

We have been experimenting with it in light of the Oracle licensing situation and it does provide interesting set of options - AOT, various GCs (metronome, gencon, balanced) along with many other differentiators to OpenJDK like JITServer which offloads JIT compilation to remote nodes.

https://www.eclipse.org/openj9/docs/gc/

It doesn't get as much coverage when it should - it's production hardened - IBM has used it and still uses it for all their products - and it's fully open source.

pron6y ago

> in light of the Oracle licensing situation

You mean the licensing situation where Oracle completed open-sourcing the entire JDK and made Java free of field-of-use restrictions for the first time in its history?

If you're talking about the JDK builds you download from Oracle, then there are two (each linking to the other): one paid, for support customers, and one 100% free and open-source: http://jdk.java.net/

blinkingled6y ago

Many organizations need to have supported 1.7 and 1.8 releases and it's a lot of money to spend on per core licensing which is a new thing after Oracle took over. The link you posted do not have free updated binaries for JDK 7 or 8. For those you have to pay. A lot.

So it makes sense to look for non Oracle JDK and along with OpenJDK, OpenJ9 is a great choice.

1 more reply

0x06y ago

That's nice, but they also completely changed the license for java 8 in a minor security update late in the game, making it super easy to accidentally click through and putting yourself or your organization at risk of massive license violations. A trojan horse if I ever saw one.

1 more reply

willvarfar6y ago

I remember wanting to connect to a jvm with a profiler and getting a license agreement as that is now an enterprise feature and costs. It’s a slippery slope.

pron6y ago

There are no more paid features since JDK 11. That low-overhead profiler is now free and open. For the first time in Java's history, the JDK is 100% free.

1 more reply

fgonzag6y ago

I think he means the part where you have to pay to use JDKs older than 6 months, which means basically everyone has to pay.

3 more replies

molodec6y ago· 4 in thread

Specific workload matter a lot. I had a good experience with Shenandoah collector on an application that generates very few intermediate objects, but once an object is created it stays in the heap for a while ( a custom made key/value store for a very specific use case). Shenandoah collector was the best in terms of throughput and memory utilization. Most collectors are generational, so surviving objects have to be moved from Eden to Survivor to Old. Shenandoah is not generational, and I suspect it has less work to do for objects that survive compare to other collectors. When most objects live long enough generational collectors hinder performance.

haxen6y ago

In the case of Hazelcast Jet and similar products, loads of young garbage are unavoidable because it comes from the data streaming through the pipeline. A generational GC should in principle get a great head start in this kind of workload, and our benchmarks have confirmed it.

bestboy6y ago

Yep, workload matters. Generational garbage collectors are fundamentally at odds with caching/pooling of objects. They are based on the assumption that objects die young. Typically that is not the case for internal caches, though. Caches usually consist of long-living/tenured objects.

NovaX6y ago

It is a stretch to claim caching is fundamentally at odds with GC. It is more correct to say that LRU breaks the generational hypothesis, because it prioritizes new entries which take a long time to be evicted. However many workloads are frequency biased and these one-hit wonders degrade the hit rate. That is why you'll see more aggressive eviction in a modern policy, so you'll have better GC behavior and higher hit rates using something like Java's Caffeine library.

haxen6y ago

Keep in mind that it's not fundamental. Generational GCs just make a bet that you can save a lot of effort by segregating the objects by age. In almost all Java workloads there's plenty of short-lived objects, and a generational GC takes care of them at an especially low cost. The price to pay for that is pretty low, basically it's the overhead of card marking (a write barrier is needed) and subsequent partial scanning of the Old Generation if there are many references from old to new objects.

Only very specialized workloads won't create much short-lived objects, and for those cases there are alternative non-generational GCs on the JVM (Z, Shenandoah).

xvilka6y ago· 2 in thread

Converting Java code to Kotlin, then compiling it with the Kotlin Native[1] is more promising from the performance point of view. Native code is always faster (assuming compiler is good enough).

[1] https://kotlinlang.org/docs/reference/native-overview.html

haxen6y ago

An ahead-of-time compiler doesn't have the advantage of the call profile of polymorphic call sites. The JIT compiler has much more inlining opportunities, and in some cases this results in better performance.

Also, there are cases where manual memory management, which usually boils down to reference counting, has great overheads where a GC-managed runtime has no overhead at all. They involve repeatedly building up and then discarding large data structures. GC algorithms simply don't see the dead objects, whereas refcount-based management must explicitly free the memory of each object.

kllrnohj6y ago

> The JIT compiler has much more inlining opportunities

That's largely only true for devirtualization, which tends to not be as much of an issue in AOT compiled languages due to having features that just make reliance on virtual calls less prevalent (think C++ templates as an example in the extreme).

The only other case where JITs can inline more than AOTs is across shared library boundaries, which can be useful but if it is useful in a particular place it's also typically easy to "fix" by just making that function statically linked (or implemented in the header, even) instead.

Otherwise the time constraints of JITs near universally mean they cannot optimize as well as AOTs, even though they do have more runtime information available. Unless you do a multi-tiered JIT approach like WebKit does ( https://webkit.org/blog/3362/introducing-the-webkit-ftl-jit/ ), with the last tier being the one that finally lets a full "AOT quality" optimization pass happen because you can finally justify the time spent on the optimizer. But then you also have ridiculous warmup latencies.

> Also, there are cases where manual memory management, which usually boils down to reference counting, has great overheads where a GC-managed runtime has no overhead at all. They involve repeatedly building up and then discarding large data structures. GC algorithms simply don't see the dead objects, whereas refcount-based management must explicitly free the memory of each object.

There's a lot more to this than such a simple claim. GC'd languages also almost always need to pay a zero'ing cost in conjunction with freeing memory which makes the actual free that happens a lot slower, and GC'd languages are slower the larger the object count gets while manual memory managed languages are ~constant. There's also more strategies in play for manual memory managed languages than just ref counting - such as just single ownership (std::unique_ptr, Rust's Box<>, etc..)

If you are doing something that involves repeatedly building up & and then discarding a data structure, though, then that's where a manual managed memory would run circles around a GC'd one. A simple arena allocator is a superb match for that and cannot be beat in performance. Bump-pointer allocation speed, zero GC pause, zero collection latency, etc... This is what games do for per-frame allocations, for example. Essentially a single-frame GC without a collection pass being needed. Not a lot of things actually do build up and then discard a structure repeatedly, so you don't get to use this trick very often, but when you can it's stupid fast.

ww5206y ago

G1 looks very good. Glad it becomes the default so one less thing to tune for a deployment.

cangencerOP5y ago

Follow-up post: https://jet-start.sh/blog/2020/06/23/jdk-gc-benchmarks-remat...

j / k navigate · click thread line to collapse

98 comments

33 comments · 6 top-level

willvarfar6y ago· 15 in thread

A very clear and interesting post.

I've been trying to fit big-enough long-running stuff into JVMs for a few years, and have found that minimizing the amount of garbage is paramount. Its a bit like games- or C programming.

Recent JVM features like 8-bit strings and not having a size-limit on the interned pools etc have been really helpful.

Now, when I examine heap-dumps and try and work out where more I can save bytes to keep GC at bay, I mostly see fragments of Instant and String, which are heavily used in my code.

If there was only a library that did date manipulation and arithmetic with longs instead of Instant :(

nicktelford6y ago

> If there was only a library that did date manipulation and arithmetic with longs instead of Instant :(

Note: if this doesn't work on OpenJDK, try GraalVM: it's Partial Escape Analysis should do a better job at finding ways of eliding heap allocations.

haxen6y ago

The worst things happen when value objects are stuck on the heap for a while and then turn into garbage when the value is updated. Escape Analysis doesn't help there, only a good GC can help.

Yeroc6y ago

shellac6y ago

On G1 you used to be able to use `-XX:+UseStringDeduplication`, which gives you back something like the string sharing used pre-whatever it was (8?).

this_user6y ago

If you are looking to do low latency, then G1 isn't the best choice these days, though. Shenandoah or ZGC are both more advanced algorithms that can greatly reduce the pauses caused by GC activity.

1 more reply

nicktelford6y ago

That appears to still be an option, see the last line of: https://docs.oracle.com/en/java/javase/14/gctuning/garbage-f...

jcfrei6y ago

That sounds very interesting. Can you provide links to the benchmarks for fast JSON parsing (libraries)? And the fast maps?

willvarfar6y ago

I can quickly list our findings (from testing on our specific actual workloads; ymmv etc):

For collections, we used to use trove but migrated to fastutil a few years ago.

Yeah, perhaps Java isn't the right tool for our job. And yeah, more recent benchmarking and testing might suggest newer, better libraries than those I have just listed.

C# and its structs, yummy.

5 more replies

rb8086y ago

You're absolutely right. Its one reason I struggle with the modern fashion for immutable classes and FP, they are always making copies of everything, seems crazy.

mumblemumble6y ago

3 more replies

stu20106y ago

1 more reply

logicchains6y ago

>Its one reason I struggle with the modern fashion for immutable classes and FP, they are always making copies of everything, seems crazy

The language Zig is particularly interesting in this regard because everything that allocates takes the allocator as a param, and it has built-in arena allocators in the standard lib.

1 more reply

Cthulhu_6y ago

int_19h6y ago

If an object is truly immutable, its object identity shouldn't matter in the vast majority of cases, and so it should be okay to just copy the whole thing, instead of passing around references to it.

Unfortunately, the legacy Java semantics of == means that they can't do this proactively. But didn't Java get opt-in value types recently?

FridgeSeal6y ago

Persistent data structures give you a mixture of performance and immutability and are common in functional programming.

blinkingled6y ago· 6 in thread

I wonder how things would have stacked with OpenJ9 - AdoptOpenJDK project makes OpenJ9 builds available for Java 8/11/13/14 - so it should be trivial to include it in the benchmarks.

https://www.eclipse.org/openj9/docs/gc/

It doesn't get as much coverage when it should - it's production hardened - IBM has used it and still uses it for all their products - and it's fully open source.

pron6y ago

> in light of the Oracle licensing situation

You mean the licensing situation where Oracle completed open-sourcing the entire JDK and made Java free of field-of-use restrictions for the first time in its history?

blinkingled6y ago

So it makes sense to look for non Oracle JDK and along with OpenJDK, OpenJ9 is a great choice.

1 more reply

0x06y ago

1 more reply

willvarfar6y ago

I remember wanting to connect to a jvm with a profiler and getting a license agreement as that is now an enterprise feature and costs. It’s a slippery slope.

pron6y ago

There are no more paid features since JDK 11. That low-overhead profiler is now free and open. For the first time in Java's history, the JDK is 100% free.

1 more reply

fgonzag6y ago

I think he means the part where you have to pay to use JDKs older than 6 months, which means basically everyone has to pay.

3 more replies

molodec6y ago· 4 in thread

haxen6y ago

bestboy6y ago

NovaX6y ago

haxen6y ago

Only very specialized workloads won't create much short-lived objects, and for those cases there are alternative non-generational GCs on the JVM (Z, Shenandoah).

xvilka6y ago· 2 in thread

Converting Java code to Kotlin, then compiling it with the Kotlin Native[1] is more promising from the performance point of view. Native code is always faster (assuming compiler is good enough).

[1] https://kotlinlang.org/docs/reference/native-overview.html

haxen6y ago

kllrnohj6y ago

> The JIT compiler has much more inlining opportunities

ww5206y ago

G1 looks very good. Glad it becomes the default so one less thing to tune for a deployment.

cangencerOP5y ago

Follow-up post: https://jet-start.sh/blog/2020/06/23/jdk-gc-benchmarks-remat...

j / k navigate · click thread line to collapse