I haven't actually inspected the emitted bytecode, so I was only reasoning from the observed speedup.
Your point about branch prediction is really interesting; it would explain how the cast becomes almost free once the type is stable in the hot path.
I'm learning a lot from this thread -- thank you for pushing on the details!
A) Without the typecast, the compiler can’t prove anything about the type, so it has to assume a fully general type. This creates a very “hard” bytecode sequence in the middle of the hotpath which can’t be inlined or optimised.
B) With the typecast, the compiler can assume the type, and thus only needs to emit type guards (as suggested in this thread). However, I’d expect those type guards to be lifted through the function as far as possible - if possible, the JIT will lift them all the way out of the loop if it can, so they’re only checked once, not on every loop iteration. This enables a much shorter sequence for getting the array length each time around the loop, and ideally avoids doing type/class checks every time.
This would avoid pressuring the branch predictor.
Most JITs have thresholds for “effort” that depend on environment and on how hot a path is measured to be at runtime. The hotter the path, the more effort the JIT will apply to optimising it (usually also expanding the scope of what it tries to optimise). But again, without seeing the assembly code (not just bytecode) of what the three different scenarios produce (unoptimised, optimised-in-test, optimised-in-prod) it would be hard to truly know what’s going on.
At best we can just speculate from experience of what these kinds of compilers do.
But even with speculation, it shouldn't be that surprising that dynamic dispatch and reflection [0] are quite expensive compared to a cast and a field access of the length property.
To be honest, this is my first time really digging into performance on a JIT runtime. I learned to code as an astronomy researcher and the training I received from my mentors was "write Python when possible, and C or Fortran when it needs to be fast." Therefore I spent a lot of time writing C, and I didn't appreciate how aggressively something like HotSpot can optimize.
(I don't mean that as a dig against Python; it's simply the mental model I absorbed.)
The realization that I can have really good performance in a high-level language like Clojure is revolutionary for me.
I'm learning a ton from the comments here. Thanks to everyone sharing their knowledge -- it's genuinely appreciated.
I should try it out some time. The Lisp family takes a bit of a mental reset to work with, but I've done it before.
> ...the training I received from my mentors was "write Python when possible, and C or Fortran when it needs to be fast."... (I don't mean that as a dig against Python; it's simply the mental model I absorbed.)
Well, you know, I've been using Python for over 20 years and that really isn't a "dig" at all. The execution of Python is famously hard to optimize even compared to other languages you might expect to be comparable. (Seriously, the current performance of JavaScript engines seems almost magical to me.) PyPy is the "JIT runtime" option there; and you can easily create micro-benchmarks where it beats the pants off the reference (written in C with fairly dumb techniques) Python implementation, but on average the improvement is... well, still pretty good ("On average, PyPy is about 3 times faster than CPython 3.11. We currently support python 3.11 and 2.7"), but shrinking over time, and it's definitely not going to put you in the performance realm of native-compiled languages.
The problem is there's really just too much that can be changed at runtime. If you look at the differences between Python and its competitors like Mojo, and the subsets and variants of Python used for things like Shedskin and Cython (and RPython, used internally for PyPy) you quickly get a sense of it.