I'm not experienced with doing micro-optimisations in Java, but I assume you can profile it to find out which individual operations are taking up the time just via a sampling trace.
Given it's common on a particular OS setup, I wouldn't be surprised if it's in the System.currentTimeMillis() call that you're doing every loop.
Would be quite funny if it's in the System.currentTimeMillis() call this happens. I'd try to figure out a way to remove that from the picture...
`perf stat -d java ...`
if your CPU supports performance counters it should give you L1 cache misses/branch misses/etc. that might give you some insight into what is different between the runs. someone else mentioned it could be the memory alignment. i think java might allocate with 8 byte alignment and maybe something funny goes on with the L1-caching if the bitset allocation is not 16 byte or 32 byte aligned.
if you are running it within docker you might need to use: ` --security-opt seccomp:unconfined --cap-add SYS_PTRACE --cap-add CAP_SYS_ADMIN`
Try running the workload with pinned CPUs assigned and not across NUMA nodes.
I've run into the same weirdness on other things and this always solves it. Some cores are better at some things than others.
My benchmarks on hardware that have less single thread power than yours:
Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz - Linux - Java 8
0 - Passes: 3200, count: 78498, Valid: true 1 - Passes: 3197, count: 78498, Valid: true 2 - Passes: 3200, count: 78498, Valid: true 3 - Passes: 3202, count: 78498, Valid: true 4 - Passes: 3207, count: 78498, Valid: true 5 - Passes: 3207, count: 78498, Valid: true 6 - Passes: 3202, count: 78498, Valid: true 7 - Passes: 3205, count: 78498, Valid: true 8 - Passes: 3209, count: 78498, Valid: true 9 - Passes: 3178, count: 78498, Valid: true
Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz - Linux - Java 18
0 - Passes: 3445, count: 78498, Valid: true 1 - Passes: 3443, count: 78498, Valid: true 2 - Passes: 3408, count: 78498, Valid: true 3 - Passes: 3449, count: 78498, Valid: true 4 - Passes: 3439, count: 78498, Valid: true 5 - Passes: 3442, count: 78498, Valid: true 6 - Passes: 3450, count: 78498, Valid: true 7 - Passes: 3445, count: 78498, Valid: true 8 - Passes: 3438, count: 78498, Valid: true 9 - Passes: 3447, count: 78498, Valid: true
I've been mostly afk, and haven't been able to act on all the clues yet, but I have conducted some of the experiments and tried to summarize them in a reproducible way here: https://github.com/PEZ/ghost-chase-condition/tree/master/tes...
[1]: https://developer.ibm.com/languages/java/semeru-runtimes/
(Not a JVM user, so take with a grain of salt).
Sometimes things like this are caused by your code being "un-jitted", in the event some criteria is met, like an exception being thrown for example (although BitSet does not look to do that internally).
Ubuntu, latest LTS.
I bought the machine for this purpose. Guessing it was sufficiently similar to the machines running the benchmarks where I stumbled across the issue: https://plummerssoftwarellc.github.io/PrimeView/?rc=30&sc=dt...
Every run passed through a different core.
0 - Passes: 5970, count: 78498, Valid: true
1 - Passes: 5768, count: 78498, Valid: true
2 - Passes: 5763, count: 78498, Valid: true
3 - Passes: 5762, count: 78498, Valid: true
4 - Passes: 5812, count: 78498, Valid: true
5 - Passes: 5743, count: 78498, Valid: true
6 - Passes: 5721, count: 78498, Valid: true
7 - Passes: 5737, count: 78498, Valid: true
8 - Passes: 5754, count: 78498, Valid: true
9 - Passes: 5768, count: 78498, Valid: trueYou’re using System.getCurrentTimeMillis() which should be w fast. My first thought was if you were using Instant and then sometimes there’s one call to VM.getNanoTimeAdjustment and sometimes two.
Man, this is a tough one. I’ll try when I’m home.
You can use “numactl --cpunodebind=0 --membind=0 <your original java benchmark command>” to pin the JVM to a particular NUMA node and immediately discount my notion as bunkum.
[1] all the reasons mentioned that make me think numa:
1. The machine affected has multiple NUMA domains
2. The M1 is a single NUMA domain chip and does not exhibit the behaviour
3. The real thing making me think this - the consistent performance flip flop is close to the ~40% performance overhead of accessing remote memory vs local that i’ve seen previously. You’re seeing higher overheads but your code is a much tighter loop of memory access than my previous experience so that could explain the difference i think. 0 - Passes: 5876, count: 78498, Valid: true
1 - Passes: 5870, count: 78498, Valid: true
2 - Passes: 5944, count: 78498, Valid: true
3 - Passes: 5903, count: 78498, Valid: true
4 - Passes: 6051, count: 78498, Valid: true
5 - Passes: 5874, count: 78498, Valid: true
6 - Passes: 5902, count: 78498, Valid: true
7 - Passes: 6041, count: 78498, Valid: true
8 - Passes: 5906, count: 78498, Valid: true
9 - Passes: 5889, count: 78498, Valid: true
Will run it 100 times now to lower the risk that this is just a coincidence.The thing that blew my mind when i ran into this was that the jvm had actually been NUMA aware for many years at that point but it was all turned off by default. This was jdk8 days so things are probably better now.
After i figured the root cause, I spent about a day logging results in excel as i tested various GC options, the only saving grace was i recall the Oracle documentation for GC options and their NUMA effects was pretty good. That optimisation time was really well spent though, i got meaningful improvements in both service latency and throughput.
Lots of stuff you could try then:
* disabling ASLR
* disabling Turbo Boost and similar CPU settings
* changing the CPU performance scaling governor (from "powersave" to "performance"): printf performance | tee /sys/devices/system/cpu/cpufreq/policy*/scaling_governor
* run the code under a real-time scheduling policy (like SCHED_FIFO) with the highest priority. If you do try this, you need to also enable actual real-time scheduling by writing "-1" to /proc/sys/kernel/sched_rt_runtime_us.
But modern CPUs are not predictable in their performance, that's why microcontrollers are usually used for hard real time requirements. So I doubt you'll ever be able to get absolutely consistent performance across all benchmark runs.
I played similar benchmarking games myself before, and it turns out that, although I did most of the stuff described above, and my code was C++ (no JIT), big slowdowns do happen with some inevitable but predictable frequency.
So, in my experience the distribution is a finite set of pulses, equivalently, the performance largely falls into some predefined level, where the fastest/best performance level is usually the most likely, and the levels roughly get less likely the worse/slower they are.
So what I'm saying is that disabling hw-vuln may make all performance levels faster, but I doubt it would eliminate many of them. I'll for sure play around with that some day, though.
Linux hw-vuln docs: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/
Here's an example of unstable benchmark performance caused by the lack of code alignment in .NET's JIT: https://github.com/dotnet/runtime/issues/43227
I think I've seen a bug report once about this deoptimization/optimization process happening in an infinite loop, but why would it happen only under Docker on Linux, but not Mac? Perhaps, the deoptimizer relies on heuristics which depend on the current environment.
Edit: Upon further testing, changing the taskset after the process launches has no effect. The JVM seems to be looking at the number of CPU cores available when it starts, and then it uses that for making compilation decisions.
[1] https://gist.github.com/srosenberg/41611d5f40cfcbad51aa27eb0...
$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 15395 MB
node 0 free: 2542 MB
node distances:
node 0
0: 10
$ numactl --cpunodebind=0 --membind=1 java PrimeSieve
libnuma: Warning: node argument 1 is out of rangeI start the interpreter and measue the time it takes to all all then numbers below 1000000000.
The first time I run it after starting the interpreter it always takes 1.4 seconds (within 0.1 second precision). The second time I measure the time it takes 1.7, and for every invocation following that it takes 2 seconds.
If I stop the interpreter and try again, I get exactly the same result.
I have not been able to explain this behaviour. This is on OpenJDK 11 by the way.
If anyone wants to test this, just run the interpreter from here: https://github.com/lokedhs/array
To run the benchmark, type the following command in the UI:
time:measureTime { +/⍳1000000000 }
My current best guess is that the optimiser decides to recompile the bytecode to improve performance, but the recompiled version actually ends up being slower.I had that happen while developing a fast "exp(double)" implementation: for some reason casting a double into an int can be very slow (at least on some architectures) if the double is not close to int range, so I took care to put each of two (exclusive) occurrences of the statement "int i = (int)foo;" after special cases dealing with huge values, but the JIT reworked the code by factoring them above the special cases, making the code much slower for huge values. My first workaround was to replace one of the "int i = (int)foo;" by "int i = -(int)-foo;", which behaved the same in practice but was not factored out by the JIT.
How's were you able to figure it out? Where should I start looking?
My bet is on the branch predictor, since IIRC AMD has a novel branch predictor that's pretty different from the Intel branch predictor (not sure about M1): In C-land you should try loop unrolling (in fact a decent compiler will do that for you). If the JVM has a control for that, force it; else do it manually and pray the JIT does the right thing.
My first intuition was the BitSet's cache & memory behavior, which might also be the case for some ranges of `p`: Internally the BitSet is probably something like a chunk of memory with bitwise indexing. So to change a bit, the machine has to load a bunch of bytes into a register, set the bit, and write that register then back to memory. This is bad(*) if you want to set e.g. bits 0 to 31 in your BitSet, because you now got 64 memory accesses instead of two. This might affect you for small p, but with p >= 64 the access stride will be larger than 64. Thinking about it, in fact, this could be something that might throw of a naive JIT optimizer. I would have to think a little bit on how to improve here were your code written in C, with the JVM I have no idea how you could optimize here. Maybe do two loops, first for p<=64 and the other for p>64.
Regarding cache misses, hm, 1M bits are just shy of 64kByte. On any modern machine that should fit into the cache; I would still monitor for cache misses though.
(*) Short story: I have a light case of PTSD since I once had to reimplement the storage backend for a Java SQL engine university project because a friend was trying to be too clever and did exactly that.
Anyway, interesting question and I wish you best of luck with figuring out what's going on there :)
//edited a few times; sorry, I'm bad at writing a comment in on go...
Meanwhile. I just published an experiment where I have switched out the BitSet for an array of booleans. Just to demonstrate my observations that then the performance is perfectly stable. https://github.com/PEZ/ghost-chase-condition/tree/master/tes... Does that affect your analysis in any way?
--security-opt seccomp:unconfined
More info here[2].[1]: https://github.com/docker/for-linux/issues/738
[2]: http://mamememo.blogspot.com/2020/05/cpu-intensive-rubypytho...
Or, if that's a too big ask, just give me some hints how to do it, and i'll update the repo.
Check the temperature of the CPU. Modern CPU will slow down when it runs too hot. Also does anti-virus got run from time to time? Does expensive backup or expensive telemetry got run during the benchmarks?
Reading the blog again and seeing the results are dropped to the exact same level step wise, it really looks like the CPU has been throttled or some quota limit has been triggered.
One thing I notice is that your sieve run doesn't return any values. Java can optimise out code that produces no effect in some circumstances.
EDIT: Although in that case, you'd expect to see it run insanely fast. Anyway, the point stands, there's lots of non obvious issues in benchmarking.
As others have mentioned, take JMH for a spin. Benchmark after a few hundred thousand iterations warmup.
Also as mentioned here, thermal throttling could have a big impact. Maybe you have access to a desktop Xeon or similar?
Anyways for the authors: try running it in podman and see if it eliminates the perf regression
Windows had famously low timer resolution on java for a while. What happens if you run each round for increasingly longer periods?