"We made it wider and deeper".
Gosh. Why didn't anyone think about doing that before?
Some examples of very interesting, non-obvious content:
* Even if store ports are kept fixed (2 in his example), adding store address generators (up to 4 in his example) actually improves performance, because it frees up load port dependencies. * Within the same core, they use two different styles of load/address address contention mechanisms which he describes as two tables, one with explicit "allows" and the other one with explicit "denies" -- which of course end up converging (I understand it refers to two different encodings which vary in what is stored). * Between cores, they have completely separate teams which reach different designs for things like this. * It was interesting to me to discover how isolated the different core design teams work (which makes sense) * It was interesting to me to picture the load/store address contention subsystem, which must be quite complex and needs to be really fast.
And I stop listing, re different types of workloads, gaming workloads being similar to DB workloads, and even more similar between them than to SPEC benchmarks and so on.
Just go read the interview if you're interested in CPU design!
[1] mostly automated: at least the dialog name labels seem to be hand-edited, as one of them has a typo
What made the transcription "cringe"? I'd like to believe it's accurate.
And the last generation was wider and deeper than the one before it, also costing power and area.
The question that should be asked ... but which would never be answered ... is "What was it that you changed that REQUIRED and ALLOWED you to go wider and deeper?"
It's not a new process node every time.
Theres no NEED to have a massive reorder buffer unless you can decode and dispatch that number of instructions in the time it takes for a load to arrive from whichever level of memory hierarchy you're optimising for. And there's no POINT if you're often going to get a misprediction in that number of instructions. Ok, so wider decode is one component of that. Is there a difference in memory latency as well? Wider decode past 3 or 4 instructions increasingly means that you can't just end your packet of decoded instructions at the first branch -- as you get wider you're increasingly going to have to both parse past a conditional branch, and then have to predict more than one branch in the same decode cycle. You'll also get into branches that jump to other instructions in the same decode group (either forward or backward).
There are all kinds of complications there, with no doubt interesting solutions, that go far beyond "we went wider and deeper".
I asked chatgpt to give a contentful summary of the interview, it seems to be more or less accurate, albeit surface level. If anyone is interested.
It gets the "why" but not the "how". Maybe someone here can prompt it further to speculate on the "how". I don't think I'll be able to verify its output well enough to do that.
Not a lot of novel information either.