> Not that I'm a professional GPU programmer. But I'm pretty certain that GPU caches are non-coherent.
Yes, I was specifically referring to general purpose CPUs; I'm quite unfamiliar with GPUs, but I don't think anybody has ever accused them of being easy to program. Also I understand that GPUs (and CPU-GPU links) is an area where remote atomics already exist.
> So it sounds like your point is that the various sync() instructions are more about these queues?
for the most part yes, specifically fences enforce ordering on any operation that can execute out of order (even on in-order CPUs memory ops can be reordered), but only up to the coherence layer (i.e. L1). Ordering from the coherence layer on is enforced by the coherence protocol. You could of course have a model where fences are needed for for global coherence, but it would be too slow (having to flush the whole cache), too hard to use (as you would need to specify which lines need to be sync'd) or both.
You could see something like the store buffer as a non-coherent cache (as reads can be fulfilled form it), with fences restoring the coherence, but I don't think it is a terribly useful model.