undefined | Better HN

0 pointsNortySpock5y ago0 comments

> x86 instruction decoder may be only about ~5% of the die

What percent of the die is an ARM instruction decoder?

0 comments

4 comments · 1 top-level

duskwuff5y ago· 3 in thread

Much less. x86 instruction decoding is complicated by the fact that instructions are variable-width and are byte-aligned (i.e. any instruction can begin at any address). This makes decoding more than one instruction per clock cycle complicated -- I believe the silicon has to try decoding instructions at every possible offset within the decode buffer, then mask out the instructions which are actually inside another instruction.

ARM A32/A64 instruction decoding is dramatically simpler -- all instructions are 32 bits wide and word-aligned, so decoding them in parallel is trivial. T32 ("Thumb") is a bit more complex, but still easier than x86.

monocasa5y ago

I totally agree with the core of your argument (aarch64 decoding is inherently simpler and more power efficient than x86), but I'll throw out there that it's not quite as bad as you say on x86 as there's some nonobvious efficiencies (I've been writing a parallel x86 decoder).

What nearly everyone uses is a 16 byte buffer aligned to the program counter being fed into the first stage decode. This first stage, yes has to look at each byte offset as if it could be a new instruction, but doesn't have to do full decode. It only finds instruction length information. From there you feed this length information in and do full decode on the byte offsets that represent actual instruction boundaries. That's how you end up with x86 cores with '4 wide decode' despite needing to initially look at each byte.

Now for the efficiencies. Each length decoder for each byte offset isn't symmetric. Only the length decoder at offset 0 in the buffer has to handle everything, and the other length decoders can simply flag "I can't handle this", and the buffer won't be shifted down past where they were on the next cycle and the byte 0 decoder can fix up any goofiness. Because of this, they can

* be stripped out of instructions that aren't really used much anymore if that helps them

* can be stripped of weird cases like handling crazy usages of prefix bytes

* don't have to handle instructions bigger than their portion of the decode buffer. For instance a length decoder starting at byte 12 can't handle more than a 4 byte instruction anyway, so that can simplify it's logic considerably. That means that the simpler length decoders end up feeding into the higher stack up full decoder selection, so some of the overhead cancels out in a nice way.

On top of that, I think that 5% includes pieces like the microcode ROMs. Modern ARM cores almost certainly have (albeit much smaller) microcode ROMs as well to handle the more complex state transitions.

Once again, totally agreed with your main point, but it's closer than what the general public consensus says.

ant6n5y ago

I wonder whether a modern byte-sized instruction encoding would sort of look like Unicode, where every byte is self synchronizing... I guess it can be even weaker than that, probably only every second or fourth byte needs to synchronize.

1 more reply

astrange5y ago

> x86 instruction decoding is complicated by the fact that instructions are variable-width and are byte-aligned (i.e. any instruction can begin at any address).

This is also not a good security property since it means you can hide secret instructions in a program by jumping into the middle of innocuous ones.

> ARM A32/A64 instruction decoding is dramatically simpler -- all instructions are 32 bits wide and word-aligned, so decoding them in parallel is trivial. T32 ("Thumb") is a bit more complex, but still easier than x86.

A64 doesn't have a Thumb equivalent, also, and supporting A32/T32 is optional.

j / k navigate · click thread line to collapse