However, if you leave register allocation to the source->VM compiler and don't do it in the VM->machine compiler (JIT), you cannot perform machine-specific register allocation (8 registers on IA-32, 16 on AMD64 and ARM, 32 on Aarch64). So register-based VMs typically work with large register sets, and JITs allocate them to the smaller register sets of real machines.
I have not read the present paper yet, but earlier papers I have read were about reducing interpretation overhead by performing fewer VM instructions (if VM instruction dispatch is expensive; dynamic superinstructions make it cheap, but not everyone implements them).