Java BitSet performance-flicker mystery (opens in new tab)

(blog.agical.se)

113 pointscospaia4y ago86 comments

86 comments

72 comments · 21 top-level

cospaiaOP4y ago· 19 in thread

I'm stumbling onto a strange Java performance irregularity, especially easy to hit with Docker on Linux. A super straightforward implementation of Eratosthenes sieve using a Java BitSet for storage, sometimes drops in performance by 50%. Even more with JDK8. Any JAVA/JDK/JVM experts here that can shed some light on the mystery for me? I've been at it a while, but it is quite a bit outside my field of knowledge, and I am out of ideas for how to proceed. The blog article linked + the Git repository is an attempt to summarize/minimize things.

Arcuru4y ago

Do you have performance traces of good and bad runs to compare? Instead of trying to come up with more theories you should start by looking at where the time is being spent.

I'm not experienced with doing micro-optimisations in Java, but I assume you can profile it to find out which individual operations are taking up the time just via a sampling trace.

Given it's common on a particular OS setup, I wouldn't be surprised if it's in the System.currentTimeMillis() call that you're doing every loop.

cospaiaOP4y ago

Yes, I should try to profile each run and compare the profiles of the bad and good ones.

Would be quite funny if it's in the System.currentTimeMillis() call this happens. I'd try to figure out a way to remove that from the picture...

1 more reply

cospaiaOP4y ago

I've now made an experiment that removes System.currentTimeMillis() as a suspect: https://github.com/PEZ/ghost-chase-condition/tree/master/tes...

dehrmann4y ago

> System.currentTimeMillis() is pretty platform-dependent, so it's a good place to look

benmmurphy4y ago

you could try running it with the linux perf command.

`perf stat -d java ...`

if your CPU supports performance counters it should give you L1 cache misses/branch misses/etc. that might give you some insight into what is different between the runs. someone else mentioned it could be the memory alignment. i think java might allocate with 8 byte alignment and maybe something funny goes on with the L1-caching if the bitset allocation is not 16 byte or 32 byte aligned.

if you are running it within docker you might need to use: ` --security-opt seccomp:unconfined --cap-add SYS_PTRACE --cap-add CAP_SYS_ADMIN`

fulafel4y ago

This would be my first go-to as well. See if you are retiring less instructions, and if yes, see if the counters tell what the CPU is stalling on vs the faster runs.

plasma4y ago

I’m not a Java expert, but potentially due to the CPU/memory OS scheduler?

Try running the workload with pinned CPUs assigned and not across NUMA nodes.

myrryr4y ago

Try running the workload with pinned CPUs assigned ^^^--- it will be this.

I've run into the same weirdness on other things and this always solves it. Some cores are better at some things than others.

1 more reply

rvr_4y ago

I think you Ryzen CPU is being throttled. It is not able to sustain it's burst speed. You should keep an eye on CPU temp and clock.

My benchmarks on hardware that have less single thread power than yours:

Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz - Linux - Java 8

0 - Passes: 3200, count: 78498, Valid: true 1 - Passes: 3197, count: 78498, Valid: true 2 - Passes: 3200, count: 78498, Valid: true 3 - Passes: 3202, count: 78498, Valid: true 4 - Passes: 3207, count: 78498, Valid: true 5 - Passes: 3207, count: 78498, Valid: true 6 - Passes: 3202, count: 78498, Valid: true 7 - Passes: 3205, count: 78498, Valid: true 8 - Passes: 3209, count: 78498, Valid: true 9 - Passes: 3178, count: 78498, Valid: true

Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz - Linux - Java 18

0 - Passes: 3445, count: 78498, Valid: true 1 - Passes: 3443, count: 78498, Valid: true 2 - Passes: 3408, count: 78498, Valid: true 3 - Passes: 3449, count: 78498, Valid: true 4 - Passes: 3439, count: 78498, Valid: true 5 - Passes: 3442, count: 78498, Valid: true 6 - Passes: 3450, count: 78498, Valid: true 7 - Passes: 3445, count: 78498, Valid: true 8 - Passes: 3438, count: 78498, Valid: true 9 - Passes: 3447, count: 78498, Valid: true

cospaiaOP4y ago

UPDATING again with that the solution is found, thanks to user CraigJPerry. The blog post is updated with some details: https://blog.agical.se/en/posts/java-bitset-performance-myst...

rr8084y ago

Up to a few years ago java couldn't see the correct size of memory allocated to the container. https://blog.softwaremill.com/docker-support-in-new-java-8-f...

cospaiaOP4y ago

Updating this to thank you all for this amazing help!

I've been mostly afk, and haven't been able to act on all the clues yet, but I have conducted some of the experiments and tried to summarize them in a reproducible way here: https://github.com/PEZ/ghost-chase-condition/tree/master/tes...

secondcoming4y ago

Are you testing on a x64 laptop? It could possibly be CPU throttling? Maybe try run the benchmark written in a non-JVM language to see if you get the same beahviour?

cospaiaOP4y ago

It's a mini-pc. Could be throttling. There are benchmarks written in tons of languages here: https://github.com/PlummersSoftwareLLC/Primes I've so far only seen this behaviour with the JVM involved.

mgaudet4y ago

Would also be interesting to see if this is a Java issue, or a JVM implementation issue: ie, does an OpenJ9 based JVM (Available under the weird IBM Semeru name here [1]) have similar behaviour?

[1]: https://developer.ibm.com/languages/java/semeru-runtimes/

cospaiaOP4y ago

A datapoint I just discovered. On JDK7 the performance is stable at the 0.5X level. It starts to toggle between 0.5X and 1X on JDK8.

toomanydoubts4y ago

Could it be that "0.5x" speed is actually 1x, and you're seeing your algorithm run at 2x sometimes on JDK8+ due to some optimization?

(Not a JVM user, so take with a grain of salt).

2 more replies

simpsond4y ago

Are you certain it’s the jvm? Have you tried this with another lang/vm?

cospaiaOP4y ago

Not certain. But there are a lot of implementations very similar to mine in tons of other languages and vms. None display this afaik. Also I get different behavior with JVM/JDK 7.

jbaiter4y ago· 4 in thread

Have you tried setting up a JMH benchmark? This should allow you to see if the JIT is the cause of your slowdowns. Also, running it under a profiler (I recommend async-profiler[1]) should give you a good idea of where the slowdown occurs which might help you pin it down further.

[1] https://github.com/jvm-profiling-tools/async-profiler

pvg4y ago

This is probably the first thing the author should try. Just a quick look at their own harness, it's sort of trying to guess what 'warmup' is required. The JIT kicks in at invocation counts, among other things, rather than time so the warmup on differently performing hardware or JVM implementations might be doing different things. JMH's purpose is to take that guesswork out of exactly that kind of microbenchmark.

winrid4y ago

Would love to see the output of this if someone does it.

Sometimes things like this are caused by your code being "un-jitted", in the event some criteria is met, like an exception being thrown for example (although BitSet does not look to do that internally).

cospaiaOP4y ago

Thanks! I'll check JMH out now. Some week ago, while I thought the problem was todo with something in Clojure - https://twitter.com/pappapez/status/1486339518902984707?s=20... - I ran async-profiler via clj-async-profiler. It turned out to not be granular enough. OTOH, the problem had different clues then and I should try it again now with the latest Java-only repro.

nqzero4y ago

i've seen similar performance deltas from run to run for both jmh and non-jmh benchmarks, though i don't remember any of them being a 100% delta. very annoying because it means you need to wrap the benchmark in a bash script and average over a dozen+ invocations of java to get semi-meaningful results

Snoozus4y ago· 4 in thread

Would love to reproduce this, please post the full specs of the machine and os!

cospaiaOP4y ago

This machine https://www.dustin.se/product/5011255694/mini-pc-pn51

Ubuntu, latest LTS.

I bought the machine for this purpose. Guessing it was sufficiently similar to the machines running the benchmarks where I stumbled across the issue: https://plummerssoftwarellc.github.io/PrimeView/?rc=30&sc=dt...

rocketChair4y ago

Am I reading this right, you bought a €730 machine just to investigate why 20 lines of code occasionally runs slow?

1 more reply

jstaph4y ago

AMD Ryzen 9 5900X - 64GB PC4-28800 C18

Every run passed through a different core.

  0 - Passes: 5970, count: 78498, Valid: true
  1 - Passes: 5768, count: 78498, Valid: true
  2 - Passes: 5763, count: 78498, Valid: true
  3 - Passes: 5762, count: 78498, Valid: true
  4 - Passes: 5812, count: 78498, Valid: true
  5 - Passes: 5743, count: 78498, Valid: true
  6 - Passes: 5721, count: 78498, Valid: true
  7 - Passes: 5737, count: 78498, Valid: true
  8 - Passes: 5754, count: 78498, Valid: true
  9 - Passes: 5768, count: 78498, Valid: true

cospaiaOP4y ago

What is meant by that the runs are passed through a different core?

renewiltord4y ago· 3 in thread

What happens if you invert the test condition? That is, you run it 10k times and then see how long that took rather than for X time and see how many times you could do it?

You’re using System.getCurrentTimeMillis() which should be w fast. My first thought was if you were using Instant and then sometimes there’s one call to VM.getNanoTimeAdjustment and sometimes two.

Man, this is a tough one. I’ll try when I’m home.

cospaiaOP4y ago

I added this experiment, and the results (spoiler, it isn't getCurrentTimeMillis) here: https://github.com/PEZ/ghost-chase-condition/tree/master/tes...

renewiltord4y ago

I suppose it was a bit of an unlikely idea. Thanks for following up.

cospaiaOP4y ago

Thanks! I’ll try this when I’m at the computer.

CraigJPerry4y ago· 2 in thread

My gut reaction is that this[1] looks like a NUMA issue. The flip-flop performance profile immediately reminds me of a very gnarly prod issue i lead the investigation of. I’d want to eliminate NUMA first from my investigation.

You can use “numactl --cpunodebind=0 --membind=0 <your original java benchmark command>” to pin the JVM to a particular NUMA node and immediately discount my notion as bunkum.

[1] all the reasons mentioned that make me think numa:

  1. The machine affected has multiple NUMA domains
  2. The M1 is a single NUMA domain chip and does not exhibit the behaviour 
  3. The real thing making me think this - the consistent performance flip flop is close to the ~40% performance overhead of accessing remote memory vs local that i’ve seen previously. You’re seeing higher overheads but your code is a much tighter loop of memory access than my previous experience so that could explain the difference i think.

cospaiaOP4y ago

This seems to be the issue. Running it 10 times gives me quite consistent performance near the top.

    0 - Passes: 5876, count: 78498, Valid: true
    1 - Passes: 5870, count: 78498, Valid: true
    2 - Passes: 5944, count: 78498, Valid: true
    3 - Passes: 5903, count: 78498, Valid: true
    4 - Passes: 6051, count: 78498, Valid: true
    5 - Passes: 5874, count: 78498, Valid: true
    6 - Passes: 5902, count: 78498, Valid: true
    7 - Passes: 6041, count: 78498, Valid: true
    8 - Passes: 5906, count: 78498, Valid: true
    9 - Passes: 5889, count: 78498, Valid: true

Will run it 100 times now to lower the risk that this is just a coincidence.

CraigJPerry4y ago

Ahh nice!

The thing that blew my mind when i ran into this was that the jvm had actually been NUMA aware for many years at that point but it was all turned off by default. This was jdk8 days so things are probably better now.

After i figured the root cause, I spent about a day logging results in excel as i tested various GC options, the only saving grace was i recall the Oracle documentation for GC options and their NUMA effects was pretty good. That optimisation time was really well spent though, i got meaningful improvements in both service latency and throughput.

1 more reply

nsajko4y ago· 2 in thread

First make sure there are no user or kernel tasks that may be hogging resources sometimes. Maybe even try disabling some peripherals and similar, at the hardware or kernel level.

Lots of stuff you could try then:

* disabling ASLR

* disabling Turbo Boost and similar CPU settings

* changing the CPU performance scaling governor (from "powersave" to "performance"): printf performance | tee /sys/devices/system/cpu/cpufreq/policy*/scaling_governor

* run the code under a real-time scheduling policy (like SCHED_FIFO) with the highest priority. If you do try this, you need to also enable actual real-time scheduling by writing "-1" to /proc/sys/kernel/sched_rt_runtime_us.

But modern CPUs are not predictable in their performance, that's why microcontrollers are usually used for hard real time requirements. So I doubt you'll ever be able to get absolutely consistent performance across all benchmark runs.

I played similar benchmarking games myself before, and it turns out that, although I did most of the stuff described above, and my code was C++ (no JIT), big slowdowns do happen with some inevitable but predictable frequency.

exikyut4y ago

On a related note, also try disabling mitigations... if that's actually a relevant thing to do (somewhat out of the loop - maybe they're off by default or maybe the overhead is minimal now, not sure).

nsajko4y ago

Disabling mitigations could improve performance, but I'm not sure whether it would change how the performance distribution looks.

So, in my experience the distribution is a finite set of pulses, equivalently, the performance largely falls into some predefined level, where the fastest/best performance level is usually the most likely, and the levels roughly get less likely the worse/slower they are.

So what I'm saying is that disabling hw-vuln may make all performance levels faster, but I doubt it would eliminate many of them. I'll for sure play around with that some day, though.

Linux hw-vuln docs: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/

mibsl4y ago· 2 in thread

Just a guess, but it kind of sounds like machine code loop alignment could be the cause. Modern CPUs really like their jump targets 32 byte aligned.

rocketChair4y ago

Aligned branch and jump targets are just for the sake of maximising instruction cache hit-rate. It will always remain a micro-optimisation and will not make a difference in this and most other cases.

mibsl4y ago

To maximize I$ hit rate you'd actually want to disable loop alignment to not inflate code size with padding.

Here's an example of unstable benchmark performance caused by the lack of code alignment in .NET's JIT: https://github.com/dotnet/runtime/issues/43227

kgeist4y ago· 2 in thread

My first thought was that it's a bug in the deoptimizer, i.e. the JIT compiler dynamically switches back to a deoptimized form to throw away invalid optimizations (or so it it thinks) to apply new optimizations which are more relevant to the current load/usage patterns. [0]

I think I've seen a bug report once about this deoptimization/optimization process happening in an infinite loop, but why would it happen only under Docker on Linux, but not Mac? Perhaps, the deoptimizer relies on heuristics which depend on the current environment.

[0] https://stackoverflow.com/a/20542675

richdougherty4y ago

I agree. Or rather, something to do with the JIT generally. I'd look into dumping JIT info or the assembly generated.

kaba04y ago

Maybe try it with the JIT disabled. My guess though is that it is related to which core the process starts on, unrelated to Java itself.

cbsmith4y ago· 2 in thread

It'd be funny if this was just effects from hyperthreading...

hashmash4y ago

I think this might be it. When I run with the epsilon GC, and run with taskset to force just one CPU core, the performance is consistently bad. When I instruct taskset to use two logical cores, the performance is good. This is a single-threaded test, so why would it need two logical CPU cores? I'm testing with a Ryzen 7 1700.

Edit: Upon further testing, changing the taskset after the process launches has no effect. The JVM seems to be looking at the number of CPU cores available when it starts, and then it uses that for making compilation decisions.

cbsmith4y ago

Yes... Have you tried disabling Hyperthreading?

srosenberg4y ago· 2 in thread

So far my attempts to reproduce the alleged performance degradation have not been fruitful. I've written up a fairly detailed gist [1] on how to get CPU performance metrics; appendix also has a dump of C1 and C2 compiled methods (useful for comparison). I also ran on 2-node NUMA; binding cpu and memory to different nodes didn't yield a repro. either.

[1] https://gist.github.com/srosenberg/41611d5f40cfcbad51aa27eb0...

cospaiaOP4y ago

I can't bind cpu and memory different.

  $ numactl --hardware
    available: 1 nodes (0)
  node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  node 0 size: 15395 MB
  node 0 free: 2542 MB
  node distances:
  node   0 
    0:  10 


  $ numactl --cpunodebind=0 --membind=1 java PrimeSieve
  libnuma: Warning: node argument 1 is out of range

cospaiaOP4y ago

I love the detail with which you describe the experiments. Let's see what I get running this on my Linux box!

lokedhs4y ago· 2 in thread

This issue isn't directly related to BitSet. I have observed the same thing in my programming language interpreter that runs on the JVM (well, it's written in Kotlin multiplatform so it runs on JS and Natively as well).

I start the interpreter and measue the time it takes to all all then numbers below 1000000000.

The first time I run it after starting the interpreter it always takes 1.4 seconds (within 0.1 second precision). The second time I measure the time it takes 1.7, and for every invocation following that it takes 2 seconds.

If I stop the interpreter and try again, I get exactly the same result.

I have not been able to explain this behaviour. This is on OpenJDK 11 by the way.

If anyone wants to test this, just run the interpreter from here: https://github.com/lokedhs/array

To run the benchmark, type the following command in the UI:

    time:measureTime { +/⍳1000000000 }

My current best guess is that the optimiser decides to recompile the bytecode to improve performance, but the recompiled version actually ends up being slower.

jffhn4y ago

>the optimiser decides to recompile the bytecode to improve performance, but the recompiled version actually ends up being slower.

I had that happen while developing a fast "exp(double)" implementation: for some reason casting a double into an int can be very slow (at least on some architectures) if the double is not close to int range, so I took care to put each of two (exclusive) occurrences of the statement "int i = (int)foo;" after special cases dealing with huge values, but the JIT reworked the code by factoring them above the special cases, making the code much slower for huge values. My first workaround was to replace one of the "int i = (int)foo;" by "int i = -(int)-foo;", which behaved the same in practice but was not factored out by the JIT.

lokedhs4y ago

Thanks. I'd guess something similar happens here. However, I doubt it's they same case as you, since there are no floating point operations involved here.

How's were you able to figure it out? Where should I start looking?

1 more reply

archi424y ago· 2 in thread

A lot of your time is spent on the branching and on accessing the memory. So on the ASM level, memory access patterns, caching and branch prediction will affect your performance.

My bet is on the branch predictor, since IIRC AMD has a novel branch predictor that's pretty different from the Intel branch predictor (not sure about M1): In C-land you should try loop unrolling (in fact a decent compiler will do that for you). If the JVM has a control for that, force it; else do it manually and pray the JIT does the right thing.

My first intuition was the BitSet's cache & memory behavior, which might also be the case for some ranges of `p`: Internally the BitSet is probably something like a chunk of memory with bitwise indexing. So to change a bit, the machine has to load a bunch of bytes into a register, set the bit, and write that register then back to memory. This is bad(*) if you want to set e.g. bits 0 to 31 in your BitSet, because you now got 64 memory accesses instead of two. This might affect you for small p, but with p >= 64 the access stride will be larger than 64. Thinking about it, in fact, this could be something that might throw of a naive JIT optimizer. I would have to think a little bit on how to improve here were your code written in C, with the JVM I have no idea how you could optimize here. Maybe do two loops, first for p<=64 and the other for p>64.

Regarding cache misses, hm, 1M bits are just shy of 64kByte. On any modern machine that should fit into the cache; I would still monitor for cache misses though.

(*) Short story: I have a light case of PTSD since I once had to reimplement the storage backend for a Java SQL engine university project because a friend was trying to be too clever and did exactly that.

Anyway, interesting question and I wish you best of luck with figuring out what's going on there :)

//edited a few times; sorry, I'm bad at writing a comment in on go...

cospaiaOP4y ago

Thanks. I'll work a bit with unpacking all this. =)

Meanwhile. I just published an experiment where I have switched out the BitSet for an array of booleans. Just to demonstrate my observations that then the performance is perfectly stable. https://github.com/PEZ/ghost-chase-condition/tree/master/tes... Does that affect your analysis in any way?

archi424y ago

I don't know the JIT internals, but just ignoring them I'd point fingers on the setBit(x) = "load byte from mem into reg, set some bit in reg, write back reg to mem". Now it's just a "write byte to memory", which is much faster at the expense of using more storage (factor 32 or 64, probably 64). Though I am surprised caching doesn't mitigate this.

stock_toaster4y ago· 2 in thread

There have been issues[1] with seccomp. Maybe try with seccomp disabled for that container?

  --security-opt seccomp:unconfined

More info here[2].

[1]: https://github.com/docker/for-linux/issues/738

[2]: http://mamememo.blogspot.com/2020/05/cpu-intensive-rubypytho...

viraptor4y ago

seccomp affects syscalls only, which this code doesn't really use. May be worth checking just in case, but it's extremely unlikely that's the issue.

stock_toaster4y ago

I thought it also turned on/off spectre/meltdown mitigations (impacts branch prediction) on vulnerable cpus.

1 more reply

anonu4y ago· 2 in thread

Garbage collection ?

hashmash4y ago

I ran the test myself with different collectors, including epsilon (which does no GC at all) and was able to reproduce the problem in all cases.

cospaiaOP4y ago

Awesome. Would you consider adding this experiment to the repo? https://github.com/PEZ/ghost-chase-condition/tree/master/tes...

Or, if that's a too big ask, just give me some hints how to do it, and i'll update the repo.

cplusplusfellow4y ago· 1 in thread

Got to be honest. This sort of thing is why I dropped Java and went to Golang (with all its warts).

kaba04y ago

From what I can see so far it doesn’t seem related to java itself at all. What other “these sort of things” do you have in mind? Java has the best observability tools, so if anything, you will have a harder time debugging performance regressions with go.

ww5204y ago

Seeing that the benchmarks are running inside Docker, would it be Docker related? Does Docker throttle CPU on different machines differently?

Check the temperature of the CPU. Modern CPU will slow down when it runs too hot. Also does anti-virus got run from time to time? Does expensive backup or expensive telemetry got run during the benchmarks?

Reading the blog again and seeing the results are dropped to the exact same level step wise, it really looks like the CPU has been throttled or some quota limit has been triggered.

MattPalmer10864y ago

I echo the earlier comments to use jmh for benchmarking. There's lots of subtle issues that jmh solves.

One thing I notice is that your sieve run doesn't return any values. Java can optimise out code that produces no effect in some circumstances.

EDIT: Although in that case, you'd expect to see it run insanely fast. Anyway, the point stands, there's lots of non obvious issues in benchmarking.

sk5t4y ago

It could be that something else is contending for L1/L2 cache.

As others have mentioned, take JMH for a spin. Benchmark after a few hundred thousand iterations warmup.

Also as mentioned here, thermal throttling could have a big impact. Maybe you have access to a desktop Xeon or similar?

dnautics4y ago

I ran into a similar problem with docker on the same problem in zig; podman and on-metal had no performance regression but docker did. I kind of gave the fuck up on the plummer's sieve shootout at that point because (of this and other performance regressions I was finding, like CPU throttling) I felt like I was fighting the stupid rules of the contest more than I was discovering things about performance.

Anyways for the authors: try running it in podman and see if it eliminates the perf regression

shadowmatter4y ago

Every other theory listed here is far more likely, but I would try changing your loop from using System.currentTimeMillis() to using System.nanoTime(). It's a higher-resolution time source that has no relation to wall clock time but is more appropriate for timings. Classes like Stopwatch from Google's Guava use it.

biehl4y ago

How do you calculate the time?

Windows had famously low timer resolution on java for a while. What happens if you run each round for increasingly longer periods?

j / k navigate · click thread line to collapse

86 comments

72 comments · 21 top-level

cospaiaOP4y ago· 19 in thread

Arcuru4y ago

Do you have performance traces of good and bad runs to compare? Instead of trying to come up with more theories you should start by looking at where the time is being spent.

I'm not experienced with doing micro-optimisations in Java, but I assume you can profile it to find out which individual operations are taking up the time just via a sampling trace.

Given it's common on a particular OS setup, I wouldn't be surprised if it's in the System.currentTimeMillis() call that you're doing every loop.

cospaiaOP4y ago

Yes, I should try to profile each run and compare the profiles of the bad and good ones.

Would be quite funny if it's in the System.currentTimeMillis() call this happens. I'd try to figure out a way to remove that from the picture...

1 more reply

cospaiaOP4y ago

I've now made an experiment that removes System.currentTimeMillis() as a suspect: https://github.com/PEZ/ghost-chase-condition/tree/master/tes...

dehrmann4y ago

> System.currentTimeMillis() is pretty platform-dependent, so it's a good place to look

benmmurphy4y ago

you could try running it with the linux perf command.

`perf stat -d java ...`

if you are running it within docker you might need to use: ` --security-opt seccomp:unconfined --cap-add SYS_PTRACE --cap-add CAP_SYS_ADMIN`

fulafel4y ago

This would be my first go-to as well. See if you are retiring less instructions, and if yes, see if the counters tell what the CPU is stalling on vs the faster runs.

plasma4y ago

I’m not a Java expert, but potentially due to the CPU/memory OS scheduler?

Try running the workload with pinned CPUs assigned and not across NUMA nodes.

myrryr4y ago

Try running the workload with pinned CPUs assigned ^^^--- it will be this.

I've run into the same weirdness on other things and this always solves it. Some cores are better at some things than others.

1 more reply

rvr_4y ago

I think you Ryzen CPU is being throttled. It is not able to sustain it's burst speed. You should keep an eye on CPU temp and clock.

My benchmarks on hardware that have less single thread power than yours:

Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz - Linux - Java 8

Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz - Linux - Java 18

cospaiaOP4y ago

UPDATING again with that the solution is found, thanks to user CraigJPerry. The blog post is updated with some details: https://blog.agical.se/en/posts/java-bitset-performance-myst...

rr8084y ago

Up to a few years ago java couldn't see the correct size of memory allocated to the container. https://blog.softwaremill.com/docker-support-in-new-java-8-f...

cospaiaOP4y ago

Updating this to thank you all for this amazing help!

secondcoming4y ago

Are you testing on a x64 laptop? It could possibly be CPU throttling? Maybe try run the benchmark written in a non-JVM language to see if you get the same beahviour?

cospaiaOP4y ago

It's a mini-pc. Could be throttling. There are benchmarks written in tons of languages here: https://github.com/PlummersSoftwareLLC/Primes I've so far only seen this behaviour with the JVM involved.

mgaudet4y ago

Would also be interesting to see if this is a Java issue, or a JVM implementation issue: ie, does an OpenJ9 based JVM (Available under the weird IBM Semeru name here [1]) have similar behaviour?

[1]: https://developer.ibm.com/languages/java/semeru-runtimes/

cospaiaOP4y ago

A datapoint I just discovered. On JDK7 the performance is stable at the 0.5X level. It starts to toggle between 0.5X and 1X on JDK8.

toomanydoubts4y ago

Could it be that "0.5x" speed is actually 1x, and you're seeing your algorithm run at 2x sometimes on JDK8+ due to some optimization?

(Not a JVM user, so take with a grain of salt).

2 more replies

simpsond4y ago

Are you certain it’s the jvm? Have you tried this with another lang/vm?

cospaiaOP4y ago

Not certain. But there are a lot of implementations very similar to mine in tons of other languages and vms. None display this afaik. Also I get different behavior with JVM/JDK 7.

jbaiter4y ago· 4 in thread

[1] https://github.com/jvm-profiling-tools/async-profiler

pvg4y ago

winrid4y ago

Would love to see the output of this if someone does it.

cospaiaOP4y ago

nqzero4y ago

Snoozus4y ago· 4 in thread

Would love to reproduce this, please post the full specs of the machine and os!

cospaiaOP4y ago

This machine https://www.dustin.se/product/5011255694/mini-pc-pn51

Ubuntu, latest LTS.

rocketChair4y ago

Am I reading this right, you bought a €730 machine just to investigate why 20 lines of code occasionally runs slow?

1 more reply

jstaph4y ago

AMD Ryzen 9 5900X - 64GB PC4-28800 C18

Every run passed through a different core.

  0 - Passes: 5970, count: 78498, Valid: true
  1 - Passes: 5768, count: 78498, Valid: true
  2 - Passes: 5763, count: 78498, Valid: true
  3 - Passes: 5762, count: 78498, Valid: true
  4 - Passes: 5812, count: 78498, Valid: true
  5 - Passes: 5743, count: 78498, Valid: true
  6 - Passes: 5721, count: 78498, Valid: true
  7 - Passes: 5737, count: 78498, Valid: true
  8 - Passes: 5754, count: 78498, Valid: true
  9 - Passes: 5768, count: 78498, Valid: true

cospaiaOP4y ago

What is meant by that the runs are passed through a different core?

renewiltord4y ago· 3 in thread

What happens if you invert the test condition? That is, you run it 10k times and then see how long that took rather than for X time and see how many times you could do it?

Man, this is a tough one. I’ll try when I’m home.

cospaiaOP4y ago

I added this experiment, and the results (spoiler, it isn't getCurrentTimeMillis) here: https://github.com/PEZ/ghost-chase-condition/tree/master/tes...

renewiltord4y ago

I suppose it was a bit of an unlikely idea. Thanks for following up.

cospaiaOP4y ago

Thanks! I’ll try this when I’m at the computer.

CraigJPerry4y ago· 2 in thread

You can use “numactl --cpunodebind=0 --membind=0 <your original java benchmark command>” to pin the JVM to a particular NUMA node and immediately discount my notion as bunkum.

[1] all the reasons mentioned that make me think numa:

  1. The machine affected has multiple NUMA domains
  2. The M1 is a single NUMA domain chip and does not exhibit the behaviour 
  3. The real thing making me think this - the consistent performance flip flop is close to the ~40% performance overhead of accessing remote memory vs local that i’ve seen previously. You’re seeing higher overheads but your code is a much tighter loop of memory access than my previous experience so that could explain the difference i think.

cospaiaOP4y ago

This seems to be the issue. Running it 10 times gives me quite consistent performance near the top.

    0 - Passes: 5876, count: 78498, Valid: true
    1 - Passes: 5870, count: 78498, Valid: true
    2 - Passes: 5944, count: 78498, Valid: true
    3 - Passes: 5903, count: 78498, Valid: true
    4 - Passes: 6051, count: 78498, Valid: true
    5 - Passes: 5874, count: 78498, Valid: true
    6 - Passes: 5902, count: 78498, Valid: true
    7 - Passes: 6041, count: 78498, Valid: true
    8 - Passes: 5906, count: 78498, Valid: true
    9 - Passes: 5889, count: 78498, Valid: true

Will run it 100 times now to lower the risk that this is just a coincidence.

CraigJPerry4y ago

Ahh nice!

1 more reply

nsajko4y ago· 2 in thread

First make sure there are no user or kernel tasks that may be hogging resources sometimes. Maybe even try disabling some peripherals and similar, at the hardware or kernel level.

Lots of stuff you could try then:

* disabling ASLR

* disabling Turbo Boost and similar CPU settings

* changing the CPU performance scaling governor (from "powersave" to "performance"): printf performance | tee /sys/devices/system/cpu/cpufreq/policy*/scaling_governor

exikyut4y ago

nsajko4y ago

Disabling mitigations could improve performance, but I'm not sure whether it would change how the performance distribution looks.

So what I'm saying is that disabling hw-vuln may make all performance levels faster, but I doubt it would eliminate many of them. I'll for sure play around with that some day, though.

Linux hw-vuln docs: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/

mibsl4y ago· 2 in thread

Just a guess, but it kind of sounds like machine code loop alignment could be the cause. Modern CPUs really like their jump targets 32 byte aligned.

rocketChair4y ago

Aligned branch and jump targets are just for the sake of maximising instruction cache hit-rate. It will always remain a micro-optimisation and will not make a difference in this and most other cases.

mibsl4y ago

To maximize I$ hit rate you'd actually want to disable loop alignment to not inflate code size with padding.

Here's an example of unstable benchmark performance caused by the lack of code alignment in .NET's JIT: https://github.com/dotnet/runtime/issues/43227

kgeist4y ago· 2 in thread

[0] https://stackoverflow.com/a/20542675

richdougherty4y ago

I agree. Or rather, something to do with the JIT generally. I'd look into dumping JIT info or the assembly generated.

kaba04y ago

Maybe try it with the JIT disabled. My guess though is that it is related to which core the process starts on, unrelated to Java itself.

cbsmith4y ago· 2 in thread

It'd be funny if this was just effects from hyperthreading...

hashmash4y ago

cbsmith4y ago

Yes... Have you tried disabling Hyperthreading?

srosenberg4y ago· 2 in thread

[1] https://gist.github.com/srosenberg/41611d5f40cfcbad51aa27eb0...

cospaiaOP4y ago

I can't bind cpu and memory different.

  $ numactl --hardware
    available: 1 nodes (0)
  node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  node 0 size: 15395 MB
  node 0 free: 2542 MB
  node distances:
  node   0 
    0:  10 


  $ numactl --cpunodebind=0 --membind=1 java PrimeSieve
  libnuma: Warning: node argument 1 is out of range

cospaiaOP4y ago

I love the detail with which you describe the experiments. Let's see what I get running this on my Linux box!

lokedhs4y ago· 2 in thread

I start the interpreter and measue the time it takes to all all then numbers below 1000000000.

If I stop the interpreter and try again, I get exactly the same result.

I have not been able to explain this behaviour. This is on OpenJDK 11 by the way.

If anyone wants to test this, just run the interpreter from here: https://github.com/lokedhs/array

To run the benchmark, type the following command in the UI:

    time:measureTime { +/⍳1000000000 }

My current best guess is that the optimiser decides to recompile the bytecode to improve performance, but the recompiled version actually ends up being slower.

jffhn4y ago

>the optimiser decides to recompile the bytecode to improve performance, but the recompiled version actually ends up being slower.

lokedhs4y ago

Thanks. I'd guess something similar happens here. However, I doubt it's they same case as you, since there are no floating point operations involved here.

How's were you able to figure it out? Where should I start looking?

1 more reply

archi424y ago· 2 in thread

A lot of your time is spent on the branching and on accessing the memory. So on the ASM level, memory access patterns, caching and branch prediction will affect your performance.

Regarding cache misses, hm, 1M bits are just shy of 64kByte. On any modern machine that should fit into the cache; I would still monitor for cache misses though.

Anyway, interesting question and I wish you best of luck with figuring out what's going on there :)

//edited a few times; sorry, I'm bad at writing a comment in on go...

cospaiaOP4y ago

Thanks. I'll work a bit with unpacking all this. =)

archi424y ago

stock_toaster4y ago· 2 in thread

There have been issues[1] with seccomp. Maybe try with seccomp disabled for that container?

  --security-opt seccomp:unconfined

More info here[2].

[1]: https://github.com/docker/for-linux/issues/738

[2]: http://mamememo.blogspot.com/2020/05/cpu-intensive-rubypytho...

viraptor4y ago

seccomp affects syscalls only, which this code doesn't really use. May be worth checking just in case, but it's extremely unlikely that's the issue.

stock_toaster4y ago

I thought it also turned on/off spectre/meltdown mitigations (impacts branch prediction) on vulnerable cpus.

1 more reply

anonu4y ago· 2 in thread

Garbage collection ?

hashmash4y ago

I ran the test myself with different collectors, including epsilon (which does no GC at all) and was able to reproduce the problem in all cases.

cospaiaOP4y ago

Awesome. Would you consider adding this experiment to the repo? https://github.com/PEZ/ghost-chase-condition/tree/master/tes...

Or, if that's a too big ask, just give me some hints how to do it, and i'll update the repo.

cplusplusfellow4y ago· 1 in thread

Got to be honest. This sort of thing is why I dropped Java and went to Golang (with all its warts).

kaba04y ago

ww5204y ago

Seeing that the benchmarks are running inside Docker, would it be Docker related? Does Docker throttle CPU on different machines differently?

Reading the blog again and seeing the results are dropped to the exact same level step wise, it really looks like the CPU has been throttled or some quota limit has been triggered.

MattPalmer10864y ago

I echo the earlier comments to use jmh for benchmarking. There's lots of subtle issues that jmh solves.

One thing I notice is that your sieve run doesn't return any values. Java can optimise out code that produces no effect in some circumstances.

EDIT: Although in that case, you'd expect to see it run insanely fast. Anyway, the point stands, there's lots of non obvious issues in benchmarking.

sk5t4y ago

It could be that something else is contending for L1/L2 cache.

As others have mentioned, take JMH for a spin. Benchmark after a few hundred thousand iterations warmup.

Also as mentioned here, thermal throttling could have a big impact. Maybe you have access to a desktop Xeon or similar?

dnautics4y ago

Anyways for the authors: try running it in podman and see if it eliminates the perf regression

shadowmatter4y ago

biehl4y ago

How do you calculate the time?

Windows had famously low timer resolution on java for a while. What happens if you run each round for increasingly longer periods?

j / k navigate · click thread line to collapse