undefined | Better HN

0 pointsdragontamer3y ago0 comments

> The explicit sync model is more representative of non coherent caches, which are not really common as they are hard to use.

Not that I'm a professional GPU programmer. But I'm pretty certain that GPU caches are non-coherent.

But yeah, cache-coherence is just assumed on modern CPUs. Your clarification on store-queues and load-queues is helpful (even if the caches are coherent, the store-queue and load-queue can still introduce an invalid reordering. So it sounds like your point is that the various sync() instructions are more about these queues?)

0 comments

1 comments · 1 top-level

gpderetta3y ago

> Not that I'm a professional GPU programmer. But I'm pretty certain that GPU caches are non-coherent.

Yes, I was specifically referring to general purpose CPUs; I'm quite unfamiliar with GPUs, but I don't think anybody has ever accused them of being easy to program. Also I understand that GPUs (and CPU-GPU links) is an area where remote atomics already exist.

> So it sounds like your point is that the various sync() instructions are more about these queues?

for the most part yes, specifically fences enforce ordering on any operation that can execute out of order (even on in-order CPUs memory ops can be reordered), but only up to the coherence layer (i.e. L1). Ordering from the coherence layer on is enforced by the coherence protocol. You could of course have a model where fences are needed for for global coherence, but it would be too slow (having to flush the whole cache), too hard to use (as you would need to specify which lines need to be sync'd) or both.

You could see something like the store buffer as a non-coherent cache (as reads can be fulfilled form it), with fences restoring the coherence, but I don't think it is a terribly useful model.

j / k navigate · click thread line to collapse