Say we had code like,
a + b
Statically compiled code might look like load a from stack to register
load b from stack to register
add a and b
A VM would look like load opcode from state to register
compute opcode address # nullop or two loads and index
jump to opcode block
load a from state to register
load b from state to register
add a and b
All of that is easily pipelined, especially by the very latest processors which speculate through indirect jumps (which is why we have Spectre, etc). The above is idealized but well reflects, I think, how modern register-based software VMs work.But when you have a JIT for a dynamically typed language, the entry and exit points of both interpreted sequences and JIT'd sequences require many more instructions to manage bookkeeping, exploding the cost. JIT'ing only works if you can compile blocks of code large enough that the benefits exceed the bookkeeping costs. But that's a tall order for dynamically typed languages where runtime mutations can invalidate JIT'd blocks at many points in a sequence, such as with prototype-based languages.
Getting "[p]roperly optimized JIT output" is the crux of the problem. It takes significant instrumentation and indirection to create and maintain "[p]roperly optimized JIT output". You can't compare the optimized machine code sequences to the analogous interpreted sequences, independent of the surrounding machinery.
Much of the performance benefit of statically compiled code isn't in execution, per se, but in the data structures. A language like Lua is constantly indexing hash tables[1] for even simple record objects, whereas in C you're usually doing direct memory references. But transforming hash table lookups in a dynamic language into direct memory references a la statically compiled C structs is extremely hard if not impossible. Engines like V8 manage to do it much of the time in the context of loading prototype methods, but for ad hoc runtime data structures I don't think it can optimize that at all.
But if your code is primarily operating on, e.g., JSON trees, it wouldn't matter one way or another. If your statically compiled code isn't benefiting from direct memory addressing of data (as is the case with many types of applications) then statically compiled, JIT'd, and interpreted code can have similar runtime profiles, and in many cases you can't even be sure which will be faster in real-world systems.
[1] Lua has opcodes for this so the cost is fixed and small relative to raw C code doing the lookup. And strings in Lua are interned so lookup is usually as simple as a mask and direct index into an array.