* actually, not even that recent, Zen planted this hope in my brain.
Real world examples are scylladb and Redpanda, both built on the seastar framework (C++ https://seastar.io/message-passing/).
And for rust there is glommio https://www.datadoghq.com/blog/engineering/introducing-glomm...
[0] https://github.com/bytedance/monoio/blob/master/docs/en/benc...
I don’t know what I’d do if I had an old Zen machine, maybe map an MPI process to each chiplett.
My impression is that in the first generation Zen machines, the cost of communicating from one chiplett to another was really quite significant, but they’ve made good enough progress there that it is only something that the really hardcode folks care about.
https://parallella.org/2015/05/25/how-the-do-i-program-the-p...
"Message Passing or Shared Memory: Evaluating the Delegation Abstraction for Multicores"
Performance degradation would greatly depend on how much data was actually touched by the workload outside the server and not solely by the fact that 75% of the memory was attached through CXL, no?
NUMA latency I measured last time on a dual-socket Xeon (Haswell) system was around 130ns for non-local memory access and 90ns for local memory access. OTOH some numbers I found seem to imply that the CXL latency is ~200ns.
This means that on average CXL latency is almost 100% larger than NUMA so I think it is not realistic to have only 10% performance degradation unless most of your workload fits into L1/L2/L3 cache plus that 25% of local memory or your workload is more CPU bound rather than memory bound.
And on architectures where some cores share faster paths than others, gradations could be scheduled that way.
Having a super short lightweight protocol like CXL.mem to talk over such fast fabric has so much killer potential.
These graphs are always such a delight to see. It's a network map, of how well connected cores are, and they reveal so many particular advantages and diaadvantages of the greater systems architecture.
Exciting to see that capability becoming more standardized with CXL.
Edit: phrasing.
Am I seeing that none of these processors implement a toroidal communication path? I thought that was considered basic cluster topology these days so I’m surprised that multi core chips don’t implement it.
If your chip is fabricated on the surface of a torus, or on a rectangle in highly curved space, then it would be a very natural architecture. But i am not aware of any chips that are.
The middle of the chip could contain logic.
>If Pentium could run at 3 GHz and the FSB got a proportional clock speed increase, core to core latency would be just over 20 ns.
Ran the test against my closest equivalent.
CPU: Intel(R) Celeron(R) G5905T CPU @ 3.30GHz Num cores: 2 Num iterations per samples: 5000 Num samples: 300
1) CAS latency on a single shared cache line
0 1
0
1 25±0
Min latency: 25.3ns ±0.2 cores: (1,0)
Max latency: 25.3ns ±0.2 cores: (1,0)
Mean latency: 25.3ns
Just wish I had a dual socket Pentium for the last 40 years.