I used it in undergrad to run benchmarks with different cache sizes and cache coherence strategies to see which were more effective. I'm sure Intel and AMD have much more advanced simulation tools though. Most likely multiple, or at least multiple levels of granularity (so you could do stuff like, simulate these potential branch predictor designs at a gate level, and then turn around and simulate the entire CPU at a higher level of abstraction.)
It's also important to note that decisions like this are hugely workload-specific. There's no single best processor for all applications. In extreme examples: almost every transistor on a vector SIMD unit is wasted when trying to optimize for a client Javascript benchmark; streaming symmetric encryption gets no benefit from L3 cache (which is like half the chip these days!); etc...
*which I swear every company has their own version of
Is anyone publishing things like that these days? I mean pages like these:
https://www.realworldtech.com/haswell-cpu/4/ https://www.anandtech.com/show/6355/intels-haswell-architect...
(I noticed that Agner Fog's chapter on Ryzen is conspicuously missing a "Literature" section.)
One of the problem is that the market for these kind of review are very much a niche. And just like all forms of free media, if there aren't enough page view they stop doing it.
I have always thought some of these media will consolidate, I mean I only ever read Anandtech, Servethehome and some Ars, and that is about it. I have RSS Header news feed from a few other sources such as Tom's hardware, Engadget but if Anandtech cover the same topic I always go there first.
Not only has that not happen, most of these website manage to stay afloat catering for different market. But I have no idea how the market segmentation works. I could tell site like Wcctech is sort of 100% rumours site with very little if any technical knowledge in writing. And yet it gathers huge amount of audience.
While others like Tom's Hardware seems to have retain enough of its news reader to become sustainable.
They tend to bang the drum when it comes to Intel but in AMD reviews you'll get things like "Due to bad luck and timing issues we have not been able to test the latest Intel and AMD servers CPU in our most demanding workloads". It's a lot like reviewing a Ferrari but due to bad luck you could only test it in city traffic.
2 years ago they forgot to cover the Threadripper launch for 2 weeks while the front page was flooded with dozens of uninteresting half page articles about Intel motherboards being launched around the same time. I love a good tech article regardless of which brand they're talking about but bias will always kill the experience for me. YMMV I guess.
https://www.servethehome.com/amd-epyc-7002-series-rome-deliv...
A few years ago they seemed to have additional sources of information (they'd talk about things like instruction-to-port assignments and penalties for moving data between integer and FP domains).
There's also just the trend of modern designs being tricky enough it's harder to infer as much about them and harder to write accessibly about what you do know; it's not super easy to figure out and describe, say, modern branch predictors simply because they're all layering a lot of strategies on each other.
For example, from Haswell on, Agner Fog essentially said Intel's large-core branch predictors are good at lots of things but there's not much he can say about how they work (p29 at https://www.agner.org/optimize/microarchitecture.pdf). Writing code to beat Cortex-A76 prefetchers, AT's Andrei Frumusanu had difficulty fooling them with anything other than essentially-random access patterns and compared them to "black magic" (https://twitter.com/andreif7/status/1102230575522430977). These aren't just random folks saying "wow, CPUs are complicated"; they successfully figured lots of stuff about past generations of CPU.
AMD did reference the TAGE family of branch predictors, which there's lots about in public literature. There might be some broadly interesting stuff in the vendors' contributions to gcc/LLVM (machine models and arch-specific optimizations).
Maybe ARM implementors talk a little more about their stuff? That might have something to do with the dynamics of the relatively open/diverse market for ARM SoCs versus the long-running one-on-one-ish x86 rivalry.
Hard to boil all that down to a single point, but if AMD and Intel want to talk more about the guts of their products, I'm sure plenty of grateful wonks would lap it up. :)
Every documentation I've seen is quite light on the branch prediction improvements. Going by the slides, they improved is accuracy by 1/3; I'd be curious to know how. Side note: if your superscalar is big enough (yeah, those registers use power), couldn't you just get rid of branch prediction at no performance cost (doing something else while waiting for the data)?
My only grudge against Zen (as a consumer) is that the AM4 socket is intended for both APUs and CPUs. While this is a good thing, I have a couple utterly useless video outputs on my motherboard. I would have liked AMD to include some display driver circuitry on every chip. Maybe in the I/O die, if they use such a thing in all of their designs going forward? I mean, I would be quite content with using software rendering when I need to drive a screen, or even spare a bit of memory bandwidth and CPU cycles to drive an extra display from my desktop's graphics card.
In one of the pictures in the article, it says the new architecture uses the TAGE Branche predictor. This is likely based on the work of Andre Seznec. There are many articles on the implementation (but they can be difficult to understand if you are not already familiar with his work).
I've implemented the bare bone predictor on a computer architecture course, you can see an abridged version of my presentation slides here[1]. Note this only describes the bare bone predictor, in recent work Andre Senzec added a Loop predictor and a Statistical Correlation Unit to increase the accuracy.
There are some work using TAGE with perceptions in the Statistical Correlation unit.
[1] https://docs.google.com/presentation/d/1aUrwD-ENYPB7pMrCoYmE...
It is nice to see research being applied to new mainstream chips relatively quickly. In complement to your slides, there is a short overview here [1] (this is actually the first search result).
Edit: the picture from AMD in this review makes me think it can hit 16 memory channels with the two socket version. Does anyone know if this is true?
Yes, if the motherboard provides all the necessary slots. The inter-socket communication is achieved by re-purposing CPU pins used for PCIe, not pins used for DRAM. Each CPU has the full 8 DRAM channels of its own.
"There are a total of eight DDR4 memory controllers on this hub chip, the same number in total that were on the Naples complex; both support one DIMM per channel and have two channels per controller, but Rome memory runs slightly faster – 3.2 GHz versus 2.67 GHz – and therefore with all memory slots filled, yields a maximum of 410 GB/sec of peak memory bandwidth per socket. That’s 45 percent higher than the Cascade Lake Xeon SP processor, which has six memory controllers for a total of 282 GB/sec of memory bandwidth running at 2.93 GHz and 21 percent higher than the 340 GB/sec that Naples turns in running that 2.67 GHz DRAM. (Those are ratings for two-socket servers.)"
I would say though that if you are bringing this to the code standards of today then this should really be wrapped up in some kind of unit test (https://github.com/sstephenson/bats )for it to pass the PR. That would make the code a bit more maintainable and can be integrated as a stage in your CI/CD pipeline.
If we do that then the intent would be clarified by the input and the expected output of the test. Then then the code would at least be maintainable and the readability problem becomes less of an issue when it comes to technical debt.
I've done this plenty of times with my teams and its certainly helped.