so, in the space of a single 16-bit thumb or rv32c instruction, you can fit 1.8–2.8 instructions, and that makes a big difference
i compiled some code for rv32ec. here are 8 instructions of it occupying 22 bytes. what would this look like in a stack bytecode?
38: 00158693 addi a3,a1,1 a1 1 + 3 bytecodes
3c: 43d8 lw a4,4(a5) a5 4 + @ 4
3e: 08e6fa63 bgeu a3,a4,d2 <.L15> < if <.L15> 3 bytecode bytes including jump target
42: 439c lw a5,0(a5) a5 @ 2
44: 058a slli a1,a1,0x2 a1 2 lshift 3
46: 7179 addi sp,sp,-48 literal 48 stackframe 3
48: 97ae add a5,a5,a1 + 1
4a: 0007a303 lw t1,0(a5) @ 1
we can see that in a stack instruction set this would use about 20
bytes, barely less, because the instructions being smaller is offset
by their being more numerous.
on the other hand, some of the code is occupied with doing
things like allocating a stack frame, which usually isn’t necessary on
a stack machinebut as far as i can tell on the x87 (which i have never programmed, i'm just going by p. a-1 (176/258) et seq. of http://bitsavers.trailing-edge.com/components/intel/80386/23...) all the instructions are at least two bytes, so i don't see where you get any extra code density
for what it's worth, the subroutine that the above was taken from compiles to 62 instructions and 156 bytes for rv32ec, 61 instructions and 189 bytes for amd64, and 52 instructions and 138 bytes for arm cortex-m4. i'll be compiling it for my own stack-based virtual machine this year but i don't have even a prototype compiler backend yet
https://www.usenix.org/legacy/events%2Fvee05%2Ffull_papers/p... [2005]
(Hey, I seem to remember tha an Anton Ertl posts to comp.compilers.)
Spoiler: they claim that with their sophisticated translation from stack to register code, they eliminated 47% of the instructions, and the resulting code is still around 25% larger than the byte code. The size advantage goes to stack-based byte code, but it may not necessarily be as large as you might think.
So more or less in line with your findings or intuition?