undefined | Better HN

0 pointssliverstorm9y ago0 comments

A unified L3 is expensive in a number of ways. It is large, which means it is geographically remote, as well as slow (for caches, big == slow). This costs lots of access latency.

It also has a bandwidth problem. If 64 threads are vying for access, you either build it with few access ports and it gets choked, or you build it with many access ports which is costly in area, power, & speed.

Two separate peer caches automatically have twice the bandwidth of one similar double-size cache, for the price of NUMA & cache coherency challenges.

There is no one right answer here. Bandwidth is far more important and coherency much easier in a small L1; as you go down the hierarchy, bandwidth needs shrink and coherency is more expensive.

0 comments

BlackMonday9y ago

I remember a rumour about a HPC-APU from AMD which would combine 16 Zen cores with a Vega GPU and HBM (High Bandwith Memory) as a L4 cache. I know a L4 cache would be much slower than a L3 cache, but I'm curious, could HBM as a L4 cache be one of the reasons why they didn't use a unified L3 cache?

Disclaimer: I don't know sh*t about hardware design as you can probably guess from my posting. ;o)

snuxoll9y ago

L4 cache is more used as embedded memory for the on-die GPU, last I checked Intel only included their eDRAM L4 cache on Iris Pro equipped model as any on-die GPU worth its salt is going to be bandwidth constrained even with a relatively low amount of GPU cores.

Same situation with Zen, if they're going to include even a Polaris it would be highly memory constrained if it had to hit system RAM all the time, so another fat chunk of memory on-die will be necessary to not starve it and keep latency down (as it stands the RX 480 can pump 256GB/s).

BlackMonday9y ago

Yeah, the bandwith problem is already noticeable with AMD's current APU's even though they use small GPU cores compared to discrete grapics cards. Faster DDR3/4 memory brings noticeable FPS improvements. If they already had HBM they would run circles around Intel (which they probably already do unless the competitor is a Iris Pro with eDRAM).

Could the CPU also profit from the HBM memory? The bandwith is much better than with DDR4 main memory (even if it is 2 or 4 channels), and I would guess the latency as well because it would be on the same die?

1 more reply

j / k navigate · click thread line to collapse

0 pointssliverstorm9y ago0 comments

A unified L3 is expensive in a number of ways. It is large, which means it is geographically remote, as well as slow (for caches, big == slow). This costs lots of access latency.

Two separate peer caches automatically have twice the bandwidth of one similar double-size cache, for the price of NUMA & cache coherency challenges.

There is no one right answer here. Bandwidth is far more important and coherency much easier in a small L1; as you go down the hierarchy, bandwidth needs shrink and coherency is more expensive.

0 comments

BlackMonday9y ago

Disclaimer: I don't know sh*t about hardware design as you can probably guess from my posting. ;o)

snuxoll9y ago

BlackMonday9y ago

1 more reply

j / k navigate · click thread line to collapse