https://www.ixpug.org/images/docs/ISC23/McCalpin_SPR_BW_limi...
https://www.ixpug.org/images/docs/ISC23/McCalpin_SPR_BW_limi...
Your 642 GB/s figure should be for a single Golden Cove core, and it should only take 3 Golden Cove cores to saturate the 1.6 TB/sec HBM2e in Xeon Max, yet internal bottlenecks prevented 56 Golden Cove cores from reaching the 642 GB/s read bandwidth you predicted a single core could reach when measured. Peak read bandwidth was 590 GB/sec when all 56 cores were reading.
According to the slides, peak read bandwidth for a single Golden Cove core in the sapphire rapids CPU that they tested is theoretically 23.6GB/sec and was measured at 22GB/sec.
Chips and Cheese did read bandwidth measurements on a non-HBM2e version of sapphire rapids:
https://chipsandcheese.com/p/a-peek-at-sapphire-rapids
They do not give an exact figure for multithreaded L3 cache bandwidth, but looking at their chart, it is around what TACC measured for HBM2e. For single threaded reads, it is about 32 GB/sec from L3 cache, which is not much better than it was for reads from HBM2e and is presumably the effect of lower latencies for L3 cache. The Chips and Cheese chart also shows that Sapphire Rapids reaches around 450 GB/sec single threaded read bandwidth for L1 cache. That is also significantly below your 642 GB/sec prediction.
The 450 GB/sec bandwidth out of L1 cache is likely a side effect of the low latency L1 accesses, which is the real purpose of L1 cache. Reaching that level of bandwidth out of L1 cache is not likely to be very useful, since bandwidth limited operations will operate on far bigger amounts of memory than fit in cache, especially L1 cache. When L1 cache bandwidth does count, the speed boost will last a maximum of about 180ns, which is negligible.
What bandwidth CPU cores should be able to get based on loads/stores per clock and what bandwidth they actually get are rarely ever in agreement. The difference is often called the Von Neumann bottleneck.
Correct.
> That is also significantly below your 642 GB/sec prediction.
Not exactly the prediction. It's an extract from one of the Chips and Cheese articles. In particular, the one that covers the architectural details of Golden Cove core and not Sapphire Rapids core. See https://chipsandcheese.com/p/popping-the-hood-on-golden-cove
From that article, their experiment shows that Golden Cove core was able to sustain 642 GB/s in L1 cache with AVX-512.
> They do not give an exact figure for multithreaded L3 cache bandwidth,
They quite literally do - it's in the graph in "Multi-threaded Bandwidth" section. 32-core Xeon Platinum 8480 instance was able to sustain 534 GB/s from L3 cache.
> The Chips and Cheese chart also shows that Sapphire Rapids reaches around 450 GB/sec single threaded read bandwidth for L1 cache.
If you look closely into my comment you're referring to you will see that I explicitly referred to Golden Cove core and not to the Sapphire Rapids core. I am not being pedantic here but they're actually different things.
And yes, Sapphire Rapids reach 450 GB/s in L1 for AVX-512 workloads. But SPR core is also clocked @3.8Ghz which is much much less than what the Golden Cove core is clocked at - @5.2GHz. And this is where the difference of ~200 GB/s comes from.
> Reaching that level of bandwidth out of L1 cache is not likely to be very useful, since bandwidth limited operations will operate on far bigger amounts of memory than fit in cache, especially L1 cache
With that said, both Intel and AMD are limited by the system memory bandwidth and both are somewhere in the range of ~100ns per memory access. The actual BW value will depend on the number of cores per chip but the BW is roughly the same since it heavily depends on the DDR interface and speed.
Does that mean that both Intel and AMD are basically of the same compute capabilities for workloads that do not fit into CPU cache?
And AMD just spent 7 years of their engineering efforts to implement what now looks like a superior CPU cache design and vectorized (SIMD) execution capabilities only to be applicable very few (mostly unimportant in grand scheme of things) workloads that actually fit into the CPU cache?
I'm not sure I follow this reasoning but if true then AMD and Intel have nothing to compete against each other since by the logic of CPU caches being limited in applicability, their designs are equally good for the most $$$ workloads.