Narrator: It didn't solve anything.
Perhaps they do a good job on popular platforms like x86, because we can encode decades of experience, but not so great on brand new ones.
Our modern CPU cores have hundreds of instructions in flight at any one moment because of the depth of OoO execution they go to. You can only go that deep on OoO if you have the branch prediction accurate enough not to choke it.
Yep. For example, on this die shot of a Skylake-X core,[0] you can see the branch predictor is about the same area as a single vector execution port (about 8% of the non-cache area).
[0]: https://twitter.com/GPUsAreMagic/status/1256866465577394181
Zen in particular combines an L1 perceptron and L2 TAGE[0] predictor[1]. TAGE in particular requires an immense amount of silicon, but it has something like 99.7% prediction accuracy, which is... crazy. The perceptron predictor is almost as good: 99.5%.
I wrote a software TAGE predictor, but too bad it didn't perform as well as predicted (heh) by the authors of the paper.
[0]: https://doi.org/10.1145/2155620.2155635 [1]: https://fuse.wikichip.org/news/2458/a-look-at-the-amd-zen-2-...
I'd call the state of the field quite bad. For example they do embarrassingly little for you to help with the 2 main bottlenecks we've had for a long time: concurrency and data layout optimization. And for even naive model (1 cpu / free memory) there is just so much potentially automateable manual toil in doing semantics based transformations in perf work that it's not even funny.
A large part is using languages that don't support these kinds of optimizations. It's not "C compiler improvements hit a wall", it continues "and we didn't develop & migrate to languages whose semantics allow optimizations". (There's a little of this in the GPU world, but the proprietary infighting there has produced a dev experience and app platform so bad that very few apps outside games venture there)
There's a whole alternative path of processor history not taken in the case that VLIW had panned out, and instead of failing because of optimism about compiler optimizations.
LLVM and GCC both have models of out of order pipelines but other than making sure they don't stall the instruction decoder it's really hazy whether they actually do anything.
The processors themselves are designed around the code the compilers emit and vice versa.