Core to core latency data on large systems (opens in new tab)

(chipsandcheese.com)

94 pointsnuriaion2y ago30 comments

30 comments

24 comments · 7 top-level

bee_rider2y ago· 13 in thread

The NUMA nature of recent* chips has made me wonder if there’s ever going to be a movement to start using message passing libraries (like MPI) on shared memory machines.

* actually, not even that recent, Zen planted this hope in my brain.

nvartolomei2y ago

Thread-per-core software architectures are doing this https://penberg.org/papers/tpc-ancs19.pdf

Real world examples are scylladb and Redpanda, both built on the seastar framework (C++ https://seastar.io/message-passing/).

And for rust there is glommio https://www.datadoghq.com/blog/engineering/introducing-glomm...

RedlineTriad2y ago

There is also another thread-per-core implementation by ByteDance (TikTok) for Rust called Monoio with benchmarks[0] comparing it to Tokio and Glommio.

[0] https://github.com/bytedance/monoio/blob/master/docs/en/benc...

sapiogram2y ago

Does thread per core necessarily imply message passing? I don't see why the two need to be related.

1 more reply

the_svd_doctor2y ago

In HPC it's common to do a mix of MPI (message-passing / distributed memory) and OpenMP (shared memory) parallelism when running on big multicore (and obviously multi-node) machines. It helps with locality, among other things.

bee_rider2y ago

This is what I do actually, and it works fairly well. Currently I do one MPI process per socket, but mostly just because the OpenMP code I’m calling is a library, and it doesn’t seem to scale well past one modern Xeon worth of cores.

I don’t know what I’d do if I had an old Zen machine, maybe map an MPI process to each chiplett.

My impression is that in the first generation Zen machines, the cost of communicating from one chiplett to another was really quite significant, but they’ve made good enough progress there that it is only something that the really hardcode folks care about.

adapteva2y ago

Ten years too early....

https://parallella.org/2015/05/25/how-the-do-i-program-the-p...

wmf2y ago

Nope, Parallela was the wrong thing at the time and it's still wrong. Cache is good.

2 more replies

senderista2y ago

A good recent paper on implementing message passing over shared memory:

"Message Passing or Shared Memory: Evaluating the Delegation Abstraction for Multicores"

https://cs.brown.edu/~irina/papers/2013-opodis.pdf

imtringued2y ago

I don't know where you got that idea from. There is a movement in the complete opposite direction with CXL. Don't waste your time with silly libraries, serialisation or networking. Have a rack that is filled with nothing but memory pooled RAM and then connect your servers (which still retain RAM as a L4 cache). You now have a huge shared memory machine with distributed CPUs using CXL for cache coherence accross the entire system. There have been benchmarks that kept 75% of the memory outside the server and the performance degradation was only 10% compared to keeping the entire data set on a single server.

menaerus2y ago

> There have been benchmarks that kept 75% of the memory outside the server and the performance degradation was only 10% compared to keeping the entire data set on a single server.

Performance degradation would greatly depend on how much data was actually touched by the workload outside the server and not solely by the fact that 75% of the memory was attached through CXL, no?

NUMA latency I measured last time on a dual-socket Xeon (Haswell) system was around 130ns for non-local memory access and 90ns for local memory access. OTOH some numbers I found seem to imply that the CXL latency is ~200ns.

This means that on average CXL latency is almost 100% larger than NUMA so I think it is not realistic to have only 10% performance degradation unless most of your workload fits into L1/L2/L3 cache plus that 25% of local memory or your workload is more CPU bound rather than memory bound.

hinkley2y ago

I keep thinking that Rust’s borrow semantics would be pretty good for hinting whether code should run on the same core or could be offloaded to another. Two modules that only communicate via small, read only messages could easily be on separate cores.

And on architectures where some cores share faster paths than others, gradations could be scheduled that way.

adgjlsfhk12y ago

IMO, MPI is the wrong level to do this on. Most apps should either be using some form of mapreduce or not using parallelism beyond the numa node.

zozbot2342y ago

The map-reduce programming model is overly simplified. It cannot express useful primitives such as prefix scan, which is used all the time in parallel algorithms.

jauntywundrkind2y ago· 2 in thread

It'll be interesting to see how CXL shakes out. It might end up being not much more than cross socket access! 150ns to go between sockets is about what we see here & is in the realm of what CXL had been promising.

Having a super short lightweight protocol like CXL.mem to talk over such fast fabric has so much killer potential.

These graphs are always such a delight to see. It's a network map, of how well connected cores are, and they reveal so many particular advantages and diaadvantages of the greater systems architecture.

rsaxvc2y ago

Back in the days before Oracle, Sun would sell you a dual socket Opteron desktop and you could add your own FPGA right on the hypertransport in the second socket.

Exciting to see that capability becoming more standardized with CXL.

Edit: phrasing.

loxias2y ago

I, too, am excited for CXL. Not enough people got to _feel_ the awesome of pmem. I think if more people had, pmem would be in all our laptops, desktops, servers.

hinkley2y ago· 2 in thread

I was misreading these charts for too long. Maybe I still am.

Am I seeing that none of these processors implement a toroidal communication path? I thought that was considered basic cluster topology these days so I’m surprised that multi core chips don’t implement it.

twic2y ago

If your chip is fabricated on a flat rectangular piece of silicon, that would involve links running from each edge, across the chip, to the other edge, in both orientations. I can imagine that would be very demanding of chip resources, slow, etc.

If your chip is fabricated on the surface of a torus, or on a rectangle in highly curved space, then it would be a very natural architecture. But i am not aware of any chips that are.

hinkley2y ago

I would presume the first couple of layers of silicon would be wires instead of gates. At least at the edges. Top left to top right, Bottom left to bottom right, top left to bottom left, top right to bottom right.

The middle of the chip could contain logic.

formerly_proven2y ago

It's almost poetic to have those mid-1990s Pentiums there, with about 2-3x the inter-socket latency of the current state-of-the-art, 30 years later.

undersuit2y ago

I like the end of the article.

>If Pentium could run at 3 GHz and the FSB got a proportional clock speed increase, core to core latency would be just over 20 ns.

Ran the test against my closest equivalent.

CPU: Intel(R) Celeron(R) G5905T CPU @ 3.30GHz Num cores: 2 Num iterations per samples: 5000 Num samples: 300

1) CAS latency on a single shared cache line

           0       1   
      0
      1   25±0 

    Min  latency: 25.3ns ±0.2 cores: (1,0)
    Max  latency: 25.3ns ±0.2 cores: (1,0)
    Mean latency: 25.3ns

Just wish I had a dual socket Pentium for the last 40 years.

nwmcsween2y ago

If I'm reading this right socket-to-socket latency hasn't really improved much in a long time, why?

gpderetta2y ago

Very interesting. Now do bandwidth next!

j / k navigate · click thread line to collapse

30 comments

24 comments · 7 top-level

bee_rider2y ago· 13 in thread

The NUMA nature of recent* chips has made me wonder if there’s ever going to be a movement to start using message passing libraries (like MPI) on shared memory machines.

* actually, not even that recent, Zen planted this hope in my brain.

nvartolomei2y ago

Thread-per-core software architectures are doing this https://penberg.org/papers/tpc-ancs19.pdf

Real world examples are scylladb and Redpanda, both built on the seastar framework (C++ https://seastar.io/message-passing/).

And for rust there is glommio https://www.datadoghq.com/blog/engineering/introducing-glomm...

RedlineTriad2y ago

There is also another thread-per-core implementation by ByteDance (TikTok) for Rust called Monoio with benchmarks[0] comparing it to Tokio and Glommio.

[0] https://github.com/bytedance/monoio/blob/master/docs/en/benc...

sapiogram2y ago

Does thread per core necessarily imply message passing? I don't see why the two need to be related.

1 more reply

the_svd_doctor2y ago

bee_rider2y ago

I don’t know what I’d do if I had an old Zen machine, maybe map an MPI process to each chiplett.

adapteva2y ago

Ten years too early....

https://parallella.org/2015/05/25/how-the-do-i-program-the-p...

wmf2y ago

Nope, Parallela was the wrong thing at the time and it's still wrong. Cache is good.

2 more replies

senderista2y ago

A good recent paper on implementing message passing over shared memory:

"Message Passing or Shared Memory: Evaluating the Delegation Abstraction for Multicores"

https://cs.brown.edu/~irina/papers/2013-opodis.pdf

imtringued2y ago

menaerus2y ago

> There have been benchmarks that kept 75% of the memory outside the server and the performance degradation was only 10% compared to keeping the entire data set on a single server.

Performance degradation would greatly depend on how much data was actually touched by the workload outside the server and not solely by the fact that 75% of the memory was attached through CXL, no?

hinkley2y ago

And on architectures where some cores share faster paths than others, gradations could be scheduled that way.

adgjlsfhk12y ago

IMO, MPI is the wrong level to do this on. Most apps should either be using some form of mapreduce or not using parallelism beyond the numa node.

zozbot2342y ago

The map-reduce programming model is overly simplified. It cannot express useful primitives such as prefix scan, which is used all the time in parallel algorithms.

jauntywundrkind2y ago· 2 in thread

Having a super short lightweight protocol like CXL.mem to talk over such fast fabric has so much killer potential.

rsaxvc2y ago

Back in the days before Oracle, Sun would sell you a dual socket Opteron desktop and you could add your own FPGA right on the hypertransport in the second socket.

Exciting to see that capability becoming more standardized with CXL.

Edit: phrasing.

loxias2y ago

I, too, am excited for CXL. Not enough people got to _feel_ the awesome of pmem. I think if more people had, pmem would be in all our laptops, desktops, servers.

hinkley2y ago· 2 in thread

I was misreading these charts for too long. Maybe I still am.

twic2y ago

If your chip is fabricated on the surface of a torus, or on a rectangle in highly curved space, then it would be a very natural architecture. But i am not aware of any chips that are.

hinkley2y ago

The middle of the chip could contain logic.

formerly_proven2y ago

It's almost poetic to have those mid-1990s Pentiums there, with about 2-3x the inter-socket latency of the current state-of-the-art, 30 years later.

undersuit2y ago

I like the end of the article.

>If Pentium could run at 3 GHz and the FSB got a proportional clock speed increase, core to core latency would be just over 20 ns.

Ran the test against my closest equivalent.

CPU: Intel(R) Celeron(R) G5905T CPU @ 3.30GHz Num cores: 2 Num iterations per samples: 5000 Num samples: 300

1) CAS latency on a single shared cache line

           0       1   
      0
      1   25±0 

    Min  latency: 25.3ns ±0.2 cores: (1,0)
    Max  latency: 25.3ns ±0.2 cores: (1,0)
    Mean latency: 25.3ns

Just wish I had a dual socket Pentium for the last 40 years.

nwmcsween2y ago

If I'm reading this right socket-to-socket latency hasn't really improved much in a long time, why?

gpderetta2y ago

Very interesting. Now do bandwidth next!

j / k navigate · click thread line to collapse