undefined | Better HN

0 pointspbsd5y ago0 comments

The x86 decoder is not running all the time; the uops cache and the LSD exist precisely to avoid this. With instructions fed from the decoders you can only sustain 4 instructions per cycle, while to get to 5 or 6 your instructions need to be coming from either the uops cache or the LSD. In the case of the Zen 3, the cache can deliver 8 uops per cycle to the pipeline (but the overall thoughput is limited elsewhere at 6)!

Furthermore, the high-performance ARM designs, starting with the Cortex-A77, started using the same trick---the 6-wide execution happens only when instructions are being fed from the decoded macro-op cache.

0 comments

8 comments · 3 top-level

ant6n5y ago· 4 in thread

How can you run 8 instructions at the same time if you only have 16 general purpose registers? You’d have to either be doing float ops or constantly spilling. So I’m integer code, how many of those instructions are just moving data between memory and registers (push/pop?).

I’d say ARM has a big advantage for instruction level parallelism with 32 registers.

mhh__5y ago

ant6n5y ago

Okay fair. But the bigger subject is inherent performance advantage of the architecture. You don’t just want to decode many instructions per cycle, you also want to issue them. So decoding width and issuing width are related.

And it seems to me that ARM has an advantage here. If you want execute 8 instructions in parallel, you gotta actually have 8 independent things that need to get executed. I guess you could have a giant out of order buffer, and include stack locations in your register renaming scheme, but it seems much easier to find parallelism if a bunch of adjacent instructions are explicitly independent. Which is much easier if you have more registers - the compiler can then help the cpu keeping all those instruction units fed.

wtallis5y ago

In practice, it appears that even though Apple is using the ARM instruction set, they are still relying on truly massive reorder buffers.

d110af5ccf5y ago

You seem to have several fairly fundamental misunderstandings about CPUs at a low level.

> include stack locations in your register renaming scheme

Registers aren't related to the stack. "The" stack is just RAM being accessed in a specific cache friendly pattern, with additional optimizations (if you use specific registers) from the hardware in the form of the stack engine. The compiler explicitly loads and stores to and from the registers named by the ISA. Register renaming has absolutely nothing to do with the stack.

When the CPU can tell that a later instruction doesn't depend on the previous value of a register, it's free to rename it. The result is that two independent registers get used even though only one was ever directly referenced. In reality, there are a _huge_ number of registers available on modern processors. Estimates place Skylake, Zen, and Cortex-X1 at 200+, with the M1 at 600+. The ISA just doesn't provide a way to access them directly. (If you want to read about this, the term to look up is reorder buffer.)

Also, there is a giant out of order buffer for stores waiting to be written back to L1. That buffer does indeed have to keep track of cache locations, which directly map to memory addresses, which sometimes happen to refer to stack locations. So in a sense, what you suggested already exists. (If you want to read about this, the term to look up is store buffer.)

> it seems much easier to find parallelism if a bunch of adjacent instructions are explicitly independent

That would indeed make things simpler in some cases. However, many operations such as loading a value into a register (ex mov, [addr]) or zeroing it (ex xor eax, eax) explicitly break the dependency chain by definition. Cases where the CPU fails to properly account for this are documented as false dependencies.

> the compiler can then help the cpu keeping all those instruction units fed

The "compiler handles ordering" thing was tried with Itanium. It seems it didn't go so well.

The CPU is free to simultaneously load two different pieces of data into the "same" register and execute two independent instruction streams on that "single" register thanks to renaming. Speculative execution helps when the CPU can't be completely certain that there isn't a dependency.

For particularly complicated sequences, the compiler spilling due to running out of named registers could indeed pose an issue. However, the CPU is free to elide a store followed by a load if it determines that the address is the same. (If you want to read about this, terms to look up include store-to-load forwarding and load-hit-store.)

1 more reply

mhh__5y ago· 1 in thread

The decoder might not be running strictly all the time, but I would wager that for some applications at least it doesn't make much of a difference. For HPC or DSP or whatever where you spend a lot of time in relatively dense loops the uop cache is probably big enough to ease the strain on the decoder, but for sparser code (Compilers come to mind, lots of function calls and memory bound work) I wouldn't be surprised if it didn't make as much difference.

I have vTune installed so I guess I could investigate this if I dig out the right PMCs

pbsdOP5y ago

I agree; compiler-type code will miss the cache most of the time. A simple test with clang++ compiling some nontrivial piece of C++:

                 0      lsd_uops                                                    
     1,092,318,746      idq_dsb_uops                                                  ( +-  0.49% )
     4,045,959,682      idq_mite_uops                                                 ( +-  0.06% )

The LSD is disabled in this chip (Skylake) due to errata, but we can see only 1/5th of the uops come from the uops cache. However, the more relevant experiment in terms of power is how many cycles is the cache active instead of the decoders:

                 0      lsd_cycles_active                                           
       378,993,057      idq_dsb_cycles                                                ( +-  0.18% )
     1,616,999,501      idq_mite_cycles                                               ( +-  0.07% )

The ratio is similar: the regular decoders are not active only around 1/5th of the time.

In comparison, gzipping a 20M file looks a lot better:

                 0      lsd_cycles_active                                           
     2,900,847,992      idq_dsb_cycles                                                ( +-  0.07% )
       407,705,985      idq_mite_cycles                                               ( +-  0.33% )

api5y ago

The LSD would have to be handling at least half the instruction stream for this to make a big dent, and it doesn't.

Forget Bitcoin mining... how many tons of CO2 are released annually decoding the X86 instruction set?

j / k navigate · click thread line to collapse