A Deep Dive into AMD’s Rome Epyc Architecture (opens in new tab)

(nextplatform.com)

135 pointslamchob6y ago45 comments

45 comments

33 comments · 6 top-level

mmrezaie6y ago· 11 in thread

There must be a simulation for this kind of architectures to see what is the best combination of size and components while making it practical! I wonder if anyone knows something like that? A tool to minmax these choices and estimate if this can be done with resources they have got.

positr0n6y ago

http://gem5.org/Main_Page Is an open source CPU simulator.

I used it in undergrad to run benchmarks with different cache sizes and cache coherence strategies to see which were more effective. I'm sure Intel and AMD have much more advanced simulation tools though. Most likely multiple, or at least multiple levels of granularity (so you could do stuff like, simulate these potential branch predictor designs at a gate level, and then turn around and simulate the entire CPU at a higher level of abstraction.)

ajross6y ago

Tools like that are a core part of the design process. You write that software along with the choice of parametrization of the design. It's not an off the shelf thing. But yes, that's how it works.

It's also important to note that decisions like this are hugely workload-specific. There's no single best processor for all applications. In extreme examples: almost every transistor on a vector SIMD unit is wasted when trying to optimize for a client Javascript benchmark; streaming symmetric encryption gets no benefit from L3 cache (which is like half the chip these days!); etc...

mmrezaie6y ago

Do people like Jim Keller are the experts in finding the right balance? Is that why he and his team are so important?

zazagura6y ago

Maybe the time has come for applications specific CPU variations?

One optimized for node.js tasks, one for databases, ...

3 more replies

Symmetry6y ago

Oh yes. That's been the dominant approach at least since Computer Architecture: A Quantitative Approach came out in '89. The search space is pretty complicated, though, given the physical interactions as well as the logical ones.

sroussey6y ago

Sounds like a lot of parameters for some ML setup

willis9366y ago

I’m pretty sure they roll their own. I would be surprised if simulation tools were not the most well guarded secrets of these companies.

chrisseaton6y ago

They have many levels of simulations for their processor designs yes but they’re proprietary.

repolfx6y ago

Yeah but they're often trying to predict where the software industry will go years in advance. You can see that in the disagreements between AMD and Intel on AVX-512: there's a chicken and egg situation where it's not always clear what's right to optimise for, as it depends on changing workloads and software platforms. For instance the big chip companies were caught out by the need for low precision maths for AI inferencing. That all came out of Google.

deepnotderp6y ago

So yes, plenty of simulators exist, many internal ones as well as gem5* but automated search space exploration, say like with mcmc, isn't in widespread use yet iirc, although plenty of academic papers have explored the topic.

*which I swear every company has their own version of

brosenlof6y ago

http://gem5.org

mjw10076y ago· 9 in thread

Up until around 2012, realworldtech.com and anandtech.com used to publish rather more detailed descriptions of the microarchitecture inside each core.

Is anyone publishing things like that these days? I mean pages like these:

https://www.realworldtech.com/haswell-cpu/4/ https://www.anandtech.com/show/6355/intels-haswell-architect...

(I noticed that Agner Fog's chapter on Ryzen is conspicuously missing a "Literature" section.)

ksec6y ago

Anandtech still does that, just no longer written by Anand himself ( He is working in Apple now ). So the writing aren't as good. Even though the technical details are still there.

One of the problem is that the market for these kind of review are very much a niche. And just like all forms of free media, if there aren't enough page view they stop doing it.

I have always thought some of these media will consolidate, I mean I only ever read Anandtech, Servethehome and some Ars, and that is about it. I have RSS Header news feed from a few other sources such as Tom's hardware, Engadget but if Anandtech cover the same topic I always go there first.

Not only has that not happen, most of these website manage to stay afloat catering for different market. But I have no idea how the market segmentation works. I could tell site like Wcctech is sort of 100% rumours site with very little if any technical knowledge in writing. And yet it gathers huge amount of audience.

While others like Tom's Hardware seems to have retain enough of its news reader to become sustainable.

close046y ago

Unfortunately while AT still has some great deep-dives for mobile SoCs (top marks to Andrei Frumusanu), the x86 articles have become a bit shallow. And if that wasn't bad enough, they also suggest some bias.

They tend to bang the drum when it comes to Intel but in AMD reviews you'll get things like "Due to bad luck and timing issues we have not been able to test the latest Intel and AMD servers CPU in our most demanding workloads". It's a lot like reviewing a Ferrari but due to bad luck you could only test it in city traffic.

2 years ago they forgot to cover the Threadripper launch for 2 weeks while the front page was flooded with dozens of uninteresting half page articles about Intel motherboards being launched around the same time. I love a good tech article regardless of which brand they're talking about but bias will always kill the experience for me. YMMV I guess.

2 more replies

ENOTTY6y ago

Wikichip is my go-to for these things.

throwaway20486y ago

The servethehome review of Rome is a pretty detailed look at the architecture.

https://www.servethehome.com/amd-epyc-7002-series-rome-deliv...

deepnotderp6y ago

Wikichip is nice, but tbh a lot of this stuff ends up in analyst reports nowadays. Look at conference proceedings if you're interested

rrss6y ago

https://www.anandtech.com/show/14525/amd-zen-2-microarchitec...

mjw10076y ago

That, and the servethehome review, seems to be basically putting the presentation slides into words.

A few years ago they seemed to have additional sources of information (they'd talk about things like instruction-to-port assignments and penalties for moving data between integer and FP domains).

2 more replies

twotwotwo6y ago

I think the tech press still tells us what they can, and stuff like execution ports, reorder windows, etc. is still publicly disclosed. AT talked about what was publicly said about Zen 2 (https://www.anandtech.com/show/14525/amd-zen-2-microarchitec...) and Sunny Cove (https://www.anandtech.com/show/14514/examining-intels-ice-la...). And their reviews do try to report the top observable results (memory latencies, relative performance on different kinds of task, power/clock info) and all that's arguably of more practical importance to lots of folks anyway.

There's also just the trend of modern designs being tricky enough it's harder to infer as much about them and harder to write accessibly about what you do know; it's not super easy to figure out and describe, say, modern branch predictors simply because they're all layering a lot of strategies on each other.

For example, from Haswell on, Agner Fog essentially said Intel's large-core branch predictors are good at lots of things but there's not much he can say about how they work (p29 at https://www.agner.org/optimize/microarchitecture.pdf). Writing code to beat Cortex-A76 prefetchers, AT's Andrei Frumusanu had difficulty fooling them with anything other than essentially-random access patterns and compared them to "black magic" (https://twitter.com/andreif7/status/1102230575522430977). These aren't just random folks saying "wow, CPUs are complicated"; they successfully figured lots of stuff about past generations of CPU.

AMD did reference the TAGE family of branch predictors, which there's lots about in public literature. There might be some broadly interesting stuff in the vendors' contributions to gcc/LLVM (machine models and arch-specific optimizations).

Maybe ARM implementors talk a little more about their stuff? That might have something to do with the dynamics of the relatively open/diverse market for ARM SoCs versus the long-running one-on-one-ish x86 rivalry.

Hard to boil all that down to a single point, but if AMD and Intel want to talk more about the guts of their products, I'm sure plenty of grateful wonks would lap it up. :)

Quequau6y ago

I despair that the market is more interested in things like mobile apps and LED equipped RAM than serious in-depth technical reporting on microprocessor internals.

MayeulC6y ago· 2 in thread

> “We like features that improve both power and performance,” Clark elaborated. “Being on the right path more often is important because the worst use of power is executing instructions that you are just going to throw away. We are not throwing work away after we figure out dynamically that we were wrong to do it. This definitely burns more power on the front end, but it pays dividends on the back end.”

Every documentation I've seen is quite light on the branch prediction improvements. Going by the slides, they improved is accuracy by 1/3; I'd be curious to know how. Side note: if your superscalar is big enough (yeah, those registers use power), couldn't you just get rid of branch prediction at no performance cost (doing something else while waiting for the data)?

My only grudge against Zen (as a consumer) is that the AM4 socket is intended for both APUs and CPUs. While this is a good thing, I have a couple utterly useless video outputs on my motherboard. I would have liked AMD to include some display driver circuitry on every chip. Maybe in the I/O die, if they use such a thing in all of their designs going forward? I mean, I would be quite content with using software rendering when I need to drive a screen, or even spare a bit of memory bandwidth and CPU cycles to drive an extra display from my desktop's graphics card.

piadodjanho6y ago

> Every documentation I've seen is quite light on the branch prediction improvements.

In one of the pictures in the article, it says the new architecture uses the TAGE Branche predictor. This is likely based on the work of Andre Seznec. There are many articles on the implementation (but they can be difficult to understand if you are not already familiar with his work).

I've implemented the bare bone predictor on a computer architecture course, you can see an abridged version of my presentation slides here[1]. Note this only describes the bare bone predictor, in recent work Andre Senzec added a Loop predictor and a Statistical Correlation Unit to increase the accuracy.

There are some work using TAGE with perceptions in the Statistical Correlation unit.

[1] https://docs.google.com/presentation/d/1aUrwD-ENYPB7pMrCoYmE...

MayeulC6y ago

Thank you, I hadn't realized those branch predictors were actually documented, and thought that they were referring to internal names.

It is nice to see research being applied to new mainstream chips relatively quickly. In complement to your slides, there is a short overview here [1] (this is actually the first search result).

1 more reply

shaklee36y ago· 2 in thread

This didn't really seem like a deep dive compared to the anandtech article. I was hoping for some memory bandwidth benchmarks, since this should be the first chip that has 8 channels without caveats (looking at you power 9). It's also not clear if it's 16 channels with 2S, but I suspect not.

Edit: the picture from AMD in this review makes me think it can hit 16 memory channels with the two socket version. Does anyone know if this is true?

wtallis6y ago

> the picture from AMD in this review makes me think it can hit 16 memory channels with the two socket version. Does anyone know if this is true?

Yes, if the motherboard provides all the necessary slots. The inter-socket communication is achieved by re-purposing CPU pins used for PCIe, not pins used for DRAM. Each CPU has the full 8 DRAM channels of its own.

LargoLasskhyfv6y ago

Somewhere at 50 to 60% down in the article:

"There are a total of eight DDR4 memory controllers on this hub chip, the same number in total that were on the Naples complex; both support one DIMM per channel and have two channels per controller, but Rome memory runs slightly faster – 3.2 GHz versus 2.67 GHz – and therefore with all memory slots filled, yields a maximum of 410 GB/sec of peak memory bandwidth per socket. That’s 45 percent higher than the Cascade Lake Xeon SP processor, which has six memory controllers for a total of 282 GB/sec of memory bandwidth running at 2.93 GHz and 21 percent higher than the 340 GB/sec that Naples turns in running that 2.67 GHz DRAM. (Those are ratings for two-socket servers.)"

thinkersilver6y ago· 2 in thread

The poster is holding a line of bash to the standard of code and is illustrating that readability should be the goal and a way of bringing bash commands to a standard of readability for something like a PR. Readability is really there to show _intent_

I would say though that if you are bringing this to the code standards of today then this should really be wrapped up in some kind of unit test (https://github.com/sstephenson/bats )for it to pass the PR. That would make the code a bit more maintainable and can be integrated as a stage in your CI/CD pipeline.

If we do that then the intent would be clarified by the input and the expected output of the test. Then then the code would at least be maintainable and the readability problem becomes less of an issue when it comes to technical debt.

I've done this plenty of times with my teams and its certainly helped.

insulanus6y ago

Are you replying to this thread? https://news.ycombinator.com/item?id=20724679

thinkersilver6y ago

Yes I was. I've posted the comment to the correct story now. I don't how that happened.

ramshanker6y ago· 1 in thread

My gut feeling is that Intel also lays out / develops IO block and cores seperately. It's just that they are all put on single silicon.

chx6y ago

But separate silicon is what gives AMD an almost insurmountable cost advantage. They can bin each chiplet separately, their yields are much higher because each die is smaller and the cherry on top is the different, cheaper process for the I/O die.

j / k navigate · click thread line to collapse

45 comments

33 comments · 6 top-level

mmrezaie6y ago· 11 in thread

positr0n6y ago

http://gem5.org/Main_Page Is an open source CPU simulator.

ajross6y ago

Tools like that are a core part of the design process. You write that software along with the choice of parametrization of the design. It's not an off the shelf thing. But yes, that's how it works.

mmrezaie6y ago

Do people like Jim Keller are the experts in finding the right balance? Is that why he and his team are so important?

zazagura6y ago

Maybe the time has come for applications specific CPU variations?

One optimized for node.js tasks, one for databases, ...

3 more replies

Symmetry6y ago

sroussey6y ago

Sounds like a lot of parameters for some ML setup

willis9366y ago

I’m pretty sure they roll their own. I would be surprised if simulation tools were not the most well guarded secrets of these companies.

chrisseaton6y ago

They have many levels of simulations for their processor designs yes but they’re proprietary.

repolfx6y ago

deepnotderp6y ago

*which I swear every company has their own version of

brosenlof6y ago

http://gem5.org

mjw10076y ago· 9 in thread

Up until around 2012, realworldtech.com and anandtech.com used to publish rather more detailed descriptions of the microarchitecture inside each core.

Is anyone publishing things like that these days? I mean pages like these:

https://www.realworldtech.com/haswell-cpu/4/ https://www.anandtech.com/show/6355/intels-haswell-architect...

(I noticed that Agner Fog's chapter on Ryzen is conspicuously missing a "Literature" section.)

ksec6y ago

Anandtech still does that, just no longer written by Anand himself ( He is working in Apple now ). So the writing aren't as good. Even though the technical details are still there.

One of the problem is that the market for these kind of review are very much a niche. And just like all forms of free media, if there aren't enough page view they stop doing it.

While others like Tom's Hardware seems to have retain enough of its news reader to become sustainable.

close046y ago

2 more replies

ENOTTY6y ago

Wikichip is my go-to for these things.

throwaway20486y ago

The servethehome review of Rome is a pretty detailed look at the architecture.

https://www.servethehome.com/amd-epyc-7002-series-rome-deliv...

deepnotderp6y ago

Wikichip is nice, but tbh a lot of this stuff ends up in analyst reports nowadays. Look at conference proceedings if you're interested

rrss6y ago

https://www.anandtech.com/show/14525/amd-zen-2-microarchitec...

mjw10076y ago

That, and the servethehome review, seems to be basically putting the presentation slides into words.

A few years ago they seemed to have additional sources of information (they'd talk about things like instruction-to-port assignments and penalties for moving data between integer and FP domains).

2 more replies

twotwotwo6y ago

Hard to boil all that down to a single point, but if AMD and Intel want to talk more about the guts of their products, I'm sure plenty of grateful wonks would lap it up. :)

Quequau6y ago

I despair that the market is more interested in things like mobile apps and LED equipped RAM than serious in-depth technical reporting on microprocessor internals.

MayeulC6y ago· 2 in thread

piadodjanho6y ago

> Every documentation I've seen is quite light on the branch prediction improvements.

There are some work using TAGE with perceptions in the Statistical Correlation unit.

[1] https://docs.google.com/presentation/d/1aUrwD-ENYPB7pMrCoYmE...

MayeulC6y ago

Thank you, I hadn't realized those branch predictors were actually documented, and thought that they were referring to internal names.

It is nice to see research being applied to new mainstream chips relatively quickly. In complement to your slides, there is a short overview here [1] (this is actually the first search result).

1 more reply

shaklee36y ago· 2 in thread

Edit: the picture from AMD in this review makes me think it can hit 16 memory channels with the two socket version. Does anyone know if this is true?

wtallis6y ago

> the picture from AMD in this review makes me think it can hit 16 memory channels with the two socket version. Does anyone know if this is true?

LargoLasskhyfv6y ago

Somewhere at 50 to 60% down in the article:

thinkersilver6y ago· 2 in thread

I've done this plenty of times with my teams and its certainly helped.

insulanus6y ago

Are you replying to this thread? https://news.ycombinator.com/item?id=20724679

thinkersilver6y ago

Yes I was. I've posted the comment to the correct story now. I don't how that happened.

ramshanker6y ago· 1 in thread

My gut feeling is that Intel also lays out / develops IO block and cores seperately. It's just that they are all put on single silicon.

chx6y ago

j / k navigate · click thread line to collapse