What percent of the die is an ARM instruction decoder?
ARM A32/A64 instruction decoding is dramatically simpler -- all instructions are 32 bits wide and word-aligned, so decoding them in parallel is trivial. T32 ("Thumb") is a bit more complex, but still easier than x86.
What nearly everyone uses is a 16 byte buffer aligned to the program counter being fed into the first stage decode. This first stage, yes has to look at each byte offset as if it could be a new instruction, but doesn't have to do full decode. It only finds instruction length information. From there you feed this length information in and do full decode on the byte offsets that represent actual instruction boundaries. That's how you end up with x86 cores with '4 wide decode' despite needing to initially look at each byte.
Now for the efficiencies. Each length decoder for each byte offset isn't symmetric. Only the length decoder at offset 0 in the buffer has to handle everything, and the other length decoders can simply flag "I can't handle this", and the buffer won't be shifted down past where they were on the next cycle and the byte 0 decoder can fix up any goofiness. Because of this, they can
* be stripped out of instructions that aren't really used much anymore if that helps them
* can be stripped of weird cases like handling crazy usages of prefix bytes
* don't have to handle instructions bigger than their portion of the decode buffer. For instance a length decoder starting at byte 12 can't handle more than a 4 byte instruction anyway, so that can simplify it's logic considerably. That means that the simpler length decoders end up feeding into the higher stack up full decoder selection, so some of the overhead cancels out in a nice way.
On top of that, I think that 5% includes pieces like the microcode ROMs. Modern ARM cores almost certainly have (albeit much smaller) microcode ROMs as well to handle the more complex state transitions.
Once again, totally agreed with your main point, but it's closer than what the general public consensus says.
This is also not a good security property since it means you can hide secret instructions in a program by jumping into the middle of innocuous ones.
> ARM A32/A64 instruction decoding is dramatically simpler -- all instructions are 32 bits wide and word-aligned, so decoding them in parallel is trivial. T32 ("Thumb") is a bit more complex, but still easier than x86.
A64 doesn't have a Thumb equivalent, also, and supporting A32/T32 is optional.