I've been trying to fit big-enough long-running stuff into JVMs for a few years, and have found that minimizing the amount of garbage is paramount. Its a bit like games- or C programming.
Recent JVM features like 8-bit strings and not having a size-limit on the interned pools etc have been really helpful.
But, for my workloads, the big wastes are still things like java.time.Instant and the overhead of temporary strings (which, these days, copy the underlying data. My code worked better when split strings used to just be views).
There are collections for much more memory-efficient (and faster) maps and things, and also efficient (and fast) JSON parsing etc. I have evaluated and benchmarked and adopted a few of these kinds of things.
Now, when I examine heap-dumps and try and work out where more I can save bytes to keep GC at bay, I mostly see fragments of Instant and String, which are heavily used in my code.
If there was only a library that did date manipulation and arithmetic with longs instead of Instant :(
You can always pass around long timestamps and just convert to Instant whenever you need to do any date/time processing. Provided the Instant doesn't escape the method it's allocated in, it should be optimized via inlining and Scalar Replacement so that it doesn't generate garbage. Of course, you'd be adding in the overhead of dividing up your long in to seconds/nanos each time.
Note: if this doesn't work on OpenJDK, try GraalVM: it's Partial Escape Analysis should do a better job at finding ways of eliding heap allocations.
That sounds very interesting. Can you provide links to the benchmarks for fast JSON parsing (libraries)? And the fast maps?
For collections, we used to use trove but migrated to fastutil a few years ago.
For JSON parsing, we are processing lots of very small messages, so use LazyJson. The biggest downside to LazyJson is it doesn't have cheap iteration of keys; the framework could easily provide it. For larger documents, say over a few MB, libraries like Jackson are faster.
Yeah, perhaps Java isn't the right tool for our job. And yeah, more recent benchmarking and testing might suggest newer, better libraries than those I have just listed.
Its horrific the lengths you have to go to to get good performance Java for the workloads we have; python prototypes run much faster with pypy, and I think that is really about heap management more than code generation.
For those of us who know C/C++, its kinda uncomfortable when staring at code and thinking "that temporary string there? 40+ bytes just for the object header!" and things. But, of course, there are advantages to working in memory-safe languages.
C# and its structs, yummy.
Where it gets tricky is in an environment like the JVM where programming in that style was not anticipated, and introducing any optimizations along these lines for the benefit of the proverbial Scala fans needs to be balanced against the obligation not to adversely impact idiomatic Java code.
That said, even without that, it's not necessarily crazy. It's just a value call: Do you believe that more functional code is easier to maintain, and perhaps value that above raw performance? I'm old enough to remember similar debates about how object-oriented C++ code should be, and to have at least encountered Usenet posts from similar debates about how structured C code should be. I don't bring this up by way of trying to weasel in some "historical inevitability" argument - these are legitimate debates, and there are still problem domains where coding guidelines may discourage, or even prohibit, certain structured programming practices. For very good reasons.
We have so many cores now that it tends to be a positive trade-off to have many threads doing some wasteful work (copies, extra GC pressure, potentially multiple threads duplicating the same work) than trying to have a perfectly optimized single thread.
It depends how it's implemented. It's possible to get very nice performance with immutability and copying through use of an arena allocator, as your stuff will essentially always be in cache (due to reusing the arena), and allocation/deallocation is just bumping a pointer. Of course, not everything easily fits into this approach, but a surprisingly large amount of code can, if designed with it in mind (and using a language that supports it without too much pain, like C/C++).
The language Zig is particularly interesting in this regard because everything that allocates takes the allocator as a param, and it has built-in arena allocators in the standard lib.
I believe you can express your problem (and solution) better using FP, once you have it solved you can zoom in and replace the most demanding segments with iterative programming, or go down lower to the bare metal.
Unfortunately, the legacy Java semantics of == means that they can't do this proactively. But didn't Java get opt-in value types recently?
We have been experimenting with it in light of the Oracle licensing situation and it does provide interesting set of options - AOT, various GCs (metronome, gencon, balanced) along with many other differentiators to OpenJDK like JITServer which offloads JIT compilation to remote nodes.
https://www.eclipse.org/openj9/docs/gc/
It doesn't get as much coverage when it should - it's production hardened - IBM has used it and still uses it for all their products - and it's fully open source.
You mean the licensing situation where Oracle completed open-sourcing the entire JDK and made Java free of field-of-use restrictions for the first time in its history?
If you're talking about the JDK builds you download from Oracle, then there are two (each linking to the other): one paid, for support customers, and one 100% free and open-source: http://jdk.java.net/
So it makes sense to look for non Oracle JDK and along with OpenJDK, OpenJ9 is a great choice.
Only very specialized workloads won't create much short-lived objects, and for those cases there are alternative non-generational GCs on the JVM (Z, Shenandoah).
[1] https://kotlinlang.org/docs/reference/native-overview.html
Also, there are cases where manual memory management, which usually boils down to reference counting, has great overheads where a GC-managed runtime has no overhead at all. They involve repeatedly building up and then discarding large data structures. GC algorithms simply don't see the dead objects, whereas refcount-based management must explicitly free the memory of each object.
That's largely only true for devirtualization, which tends to not be as much of an issue in AOT compiled languages due to having features that just make reliance on virtual calls less prevalent (think C++ templates as an example in the extreme).
The only other case where JITs can inline more than AOTs is across shared library boundaries, which can be useful but if it is useful in a particular place it's also typically easy to "fix" by just making that function statically linked (or implemented in the header, even) instead.
Otherwise the time constraints of JITs near universally mean they cannot optimize as well as AOTs, even though they do have more runtime information available. Unless you do a multi-tiered JIT approach like WebKit does ( https://webkit.org/blog/3362/introducing-the-webkit-ftl-jit/ ), with the last tier being the one that finally lets a full "AOT quality" optimization pass happen because you can finally justify the time spent on the optimizer. But then you also have ridiculous warmup latencies.
> Also, there are cases where manual memory management, which usually boils down to reference counting, has great overheads where a GC-managed runtime has no overhead at all. They involve repeatedly building up and then discarding large data structures. GC algorithms simply don't see the dead objects, whereas refcount-based management must explicitly free the memory of each object.
There's a lot more to this than such a simple claim. GC'd languages also almost always need to pay a zero'ing cost in conjunction with freeing memory which makes the actual free that happens a lot slower, and GC'd languages are slower the larger the object count gets while manual memory managed languages are ~constant. There's also more strategies in play for manual memory managed languages than just ref counting - such as just single ownership (std::unique_ptr, Rust's Box<>, etc..)
If you are doing something that involves repeatedly building up & and then discarding a data structure, though, then that's where a manual managed memory would run circles around a GC'd one. A simple arena allocator is a superb match for that and cannot be beat in performance. Bump-pointer allocation speed, zero GC pause, zero collection latency, etc... This is what games do for per-frame allocations, for example. Essentially a single-frame GC without a collection pass being needed. Not a lot of things actually do build up and then discard a structure repeatedly, so you don't get to use this trick very often, but when you can it's stupid fast.