Furthermore, the high-performance ARM designs, starting with the Cortex-A77, started using the same trick---the 6-wide execution happens only when instructions are being fed from the decoded macro-op cache.
I’d say ARM has a big advantage for instruction level parallelism with 32 registers.
And it seems to me that ARM has an advantage here. If you want execute 8 instructions in parallel, you gotta actually have 8 independent things that need to get executed. I guess you could have a giant out of order buffer, and include stack locations in your register renaming scheme, but it seems much easier to find parallelism if a bunch of adjacent instructions are explicitly independent. Which is much easier if you have more registers - the compiler can then help the cpu keeping all those instruction units fed.
> include stack locations in your register renaming scheme
Registers aren't related to the stack. "The" stack is just RAM being accessed in a specific cache friendly pattern, with additional optimizations (if you use specific registers) from the hardware in the form of the stack engine. The compiler explicitly loads and stores to and from the registers named by the ISA. Register renaming has absolutely nothing to do with the stack.
When the CPU can tell that a later instruction doesn't depend on the previous value of a register, it's free to rename it. The result is that two independent registers get used even though only one was ever directly referenced. In reality, there are a _huge_ number of registers available on modern processors. Estimates place Skylake, Zen, and Cortex-X1 at 200+, with the M1 at 600+. The ISA just doesn't provide a way to access them directly. (If you want to read about this, the term to look up is reorder buffer.)
Also, there is a giant out of order buffer for stores waiting to be written back to L1. That buffer does indeed have to keep track of cache locations, which directly map to memory addresses, which sometimes happen to refer to stack locations. So in a sense, what you suggested already exists. (If you want to read about this, the term to look up is store buffer.)
> it seems much easier to find parallelism if a bunch of adjacent instructions are explicitly independent
That would indeed make things simpler in some cases. However, many operations such as loading a value into a register (ex mov, [addr]) or zeroing it (ex xor eax, eax) explicitly break the dependency chain by definition. Cases where the CPU fails to properly account for this are documented as false dependencies.
> the compiler can then help the cpu keeping all those instruction units fed
The "compiler handles ordering" thing was tried with Itanium. It seems it didn't go so well.
The CPU is free to simultaneously load two different pieces of data into the "same" register and execute two independent instruction streams on that "single" register thanks to renaming. Speculative execution helps when the CPU can't be completely certain that there isn't a dependency.
For particularly complicated sequences, the compiler spilling due to running out of named registers could indeed pose an issue. However, the CPU is free to elide a store followed by a load if it determines that the address is the same. (If you want to read about this, terms to look up include store-to-load forwarding and load-hit-store.)
I have vTune installed so I guess I could investigate this if I dig out the right PMCs
0 lsd_uops
1,092,318,746 idq_dsb_uops ( +- 0.49% )
4,045,959,682 idq_mite_uops ( +- 0.06% )
The LSD is disabled in this chip (Skylake) due to errata, but we can see only 1/5th of the uops come from the uops cache. However, the more relevant experiment in terms of power is how many cycles is the cache active instead of the decoders: 0 lsd_cycles_active
378,993,057 idq_dsb_cycles ( +- 0.18% )
1,616,999,501 idq_mite_cycles ( +- 0.07% )
The ratio is similar: the regular decoders are not active only around 1/5th of the time.In comparison, gzipping a 20M file looks a lot better:
0 lsd_cycles_active
2,900,847,992 idq_dsb_cycles ( +- 0.07% )
407,705,985 idq_mite_cycles ( +- 0.33% )Forget Bitcoin mining... how many tons of CO2 are released annually decoding the X86 instruction set?