undefined | Better HN

0 pointswahern8y ago0 comments

> Properly optimized JIT output should be orders of magnitude faster than interpreted bytecode

Say we had code like,

  a + b

Statically compiled code might look like

  load a from stack to register
  load b from stack to register
  add a and b

A VM would look like

  load opcode from state to register
  compute opcode address # nullop or two loads and index
  jump to opcode block
  load a from state to register
  load b from state to register
  add a and b

All of that is easily pipelined, especially by the very latest processors which speculate through indirect jumps (which is why we have Spectre, etc). The above is idealized but well reflects, I think, how modern register-based software VMs work.

But when you have a JIT for a dynamically typed language, the entry and exit points of both interpreted sequences and JIT'd sequences require many more instructions to manage bookkeeping, exploding the cost. JIT'ing only works if you can compile blocks of code large enough that the benefits exceed the bookkeeping costs. But that's a tall order for dynamically typed languages where runtime mutations can invalidate JIT'd blocks at many points in a sequence, such as with prototype-based languages.

Getting "[p]roperly optimized JIT output" is the crux of the problem. It takes significant instrumentation and indirection to create and maintain "[p]roperly optimized JIT output". You can't compare the optimized machine code sequences to the analogous interpreted sequences, independent of the surrounding machinery.

Much of the performance benefit of statically compiled code isn't in execution, per se, but in the data structures. A language like Lua is constantly indexing hash tables[1] for even simple record objects, whereas in C you're usually doing direct memory references. But transforming hash table lookups in a dynamic language into direct memory references a la statically compiled C structs is extremely hard if not impossible. Engines like V8 manage to do it much of the time in the context of loading prototype methods, but for ad hoc runtime data structures I don't think it can optimize that at all.

But if your code is primarily operating on, e.g., JSON trees, it wouldn't matter one way or another. If your statically compiled code isn't benefiting from direct memory addressing of data (as is the case with many types of applications) then statically compiled, JIT'd, and interpreted code can have similar runtime profiles, and in many cases you can't even be sure which will be faster in real-world systems.

[1] Lua has opcodes for this so the cost is fixed and small relative to raw C code doing the lookup. And strings in Lua are interned so lookup is usually as simple as a mask and direct index into an array.

0 comments

1 comments · 1 top-level

arghwhat8y ago

> A VM would look like

    load opcode from state to register
    compute opcode address # nullop or two loads and index
    jump to opcode block
    load a from state to register
    load b from state to register
    add a and b

This example means that a simple addition requires several loads, a jump, and due to the nature of such a VM, also a store that you did not mention so that the next opcode can use the result.

The JIT'ed version might end up as simple as:

    ADD ECX, EDX

Why? The JIT version will take care of making the input simple platform integers, and can ensure that the values are kept in registers between all computations within the function. for fully JIT'd methods, args and return values can be passed as registers, never needing a load, and if a parameter is a small constant, it can be inlined into the instruction (e.g. ADD EAX, 3). Function calls can also be inlined, and depending on things, the arguments may be passed as registers to non-inlined JIT'd functions as well.

While the (extremely simplified!) VM example you gave can surely be pipelined well, it still has extremely suboptimal performance compared to a single "add" instruction (long jumping around is not free!), and wastes resources that could have been used to run the rest of the code. Instruction cache, micro-op cache and instruction decoder time are all very limited and valuable resources that would be wasted by such a VM.

Furthermore, in a real-world scenario, the VM version will be much worse, with potentially overloadable add operators and the likes, effectively making "add a + b" into a madness that potentially involves looking up inheritance chains to find an implementation of an "add" method.

The assumption that the JIT will need to have checks as a significant portion of the runtime for something like this would normally be false, unless the function indeed only implements "a+b". For a method-based JIT (i.e. jitting methods rather than arbitrary loops), any necessary checks will be at the beginning of the method as a one-time cost. These checks are usually quite basic, such as checking that an input is a primitive number (meaning no overloaded functionality or otherwise unexpected behavior), which is a simple compare. Assuming you do some longer math, the check becomes a very insignificant portion of the execution time.

I agree that data structures is a big part of performance (less wasted memory access), and that dynamic languages can't get quite as good as static languages in this regard, but V8 certainly does quite well in this regard. In JIT'd methods, it will even throw away JS types entirely and just operate in proper primitives if it can (V8's hidden classes are also quite robust, as long as you stick somewhat to the initial layout created by the method initially returning the object, such as a prototype constructor). However, I think it is incorrect to consider the better performance to not also be due to much better use of instructions and registers. With what I do for work, a single branch in the wrong place can lead to a 15% drop in performance, so I certainly do not think that CPU time should be tossed around.

I agree that things like navigating a large tree structure is mostly memory bound, but even then, a VM will show some amount of performance degradation over natively compiled code.

j / k navigate · click thread line to collapse