> the only part that could theoretically make sense is that the compiler could emit non-temporal store instructions to bypass the cache. I know compilers currently don't do that for volatile, but I don't know why.
Two reasons:
First, using nontemporal accesses would break mixed volatile and non-volatile accesses to the same memory, something which is not defined by the C standard but which some programs rely on anyway.
Second, more importantly: why would they?
- If the address you’re accessing points to hardware registers, the page table entry should be marked non-cacheable, which makes nontemporal accesses unnecessary. And if for some reason it’s not marked properly, nontemporal accesses wouldn’t be sufficient to guarantee that things work anyway, because nontemporal is just a hint which the hardware may not respect. In any case, at least on x86, AFAIK the only nontemporal instructions access 128+ bits of memory at a time, which wouldn’t even work for hardware registers (which generally require you to use a specific access size).
- If the address you’re using points to regular memory, on the other hand, volatile is probably being used to implement atomics, in which case bypassing the cache is unnecessary and also slow. In theory, compilers could compile volatile into accesses surrounded by memory barrier instructions, which would enforce a stronger memory ordering (while being faster than bypassing the cache entirely), especially useful on architectures with weaker memory models than x86. In fact, that’s what volatile does in Java. But in C, it’s pretty long-established that volatile accesses should just compile to regular load/store instructions, and any necessary barriers must be inserted manually. People writing high-performance code wouldn’t be happy if the compiler started inserting unnecessary barrier instructions for them… In any case, usage of volatile for atomics is deprecated in favor of C/C++11 atomics, which do insert barriers for you.