The more commonly used language specs, IME, are C99 and C89!
That's 20 and 30 years of 'stability'.
Yuck.
Just because it works on x86, doesn't mean it works on ARM, MIPS, POWER or RISC-V. CPUs other than x86 can reorder stores with other stores and loads with other loads. It can cause the CPU to do the store that starts DMA before the stores that set up length and address are done!
Or just use C11 or C++11 memory model. Although those are still not available in too many cases, curse of having to use an ancient compiler...
Atomics may be implemented with locks, which makes them unsuitable for signal handlers. The only guaranteed lock-free type is `std::atomic_flag` which is not very useful.
`volatile sig_atomic_t` still seems like the better choice for signals.
It should not be overused, because as the article mentions it makes for slower and more confusing code, but it's not quite something to be afraid of either.
It is slower to use volatile, and bad form
https://github.com/torvalds/linux/blob/master/Documentation/...
https://www.mjmwired.net/kernel/Documentation/volatile-consi...
https://www.kernel.org/doc/html/latest/process/volatile-cons...
int* p; // pointer to int
int volatile* p_to_vol; // pointer to volatile int
int* volatile vol_p; // volatile pointer to int
int volatile* volatile vol_p_to_vol; // volatile pointer to volatile int
This method always starts with the most basic type, then adds modifiers sequentially. The modifier binds to everything left of it. > "Side note: although at first glance this code looks like it fails to account for the case where TCNT1 overflows from 65535 to 0 during the timing run, it actually works properly for all durations between 0 and 65535 ticks."
From example 1, ignoring device and setup-specifics what to do when TCNT1 overflows, it actually works properly for all ticks, both "first" and "second" are unsigned (therefore behaviour is defined), and the delta between them both is always between 0 and 65535, no matter what values they may have, and also correct in all cases.E.g.:
timeDelta = timeStampNow - timeStampLast = 0 - 65535 = 1For issue #5, a possible solution not mentioned could be to write inline assembly, no? It would keep the array non-volatile and should be portable.
C++ atomics are no good here, because they are not guaranteed to be lock free or address free.
In POSIX' case, it's up to POSIX operating systems to define reasonable semantics on the memory, using constructs like PTHREAD_PROCESS_SHARED and "robust" pthread mutexes.
That's not right; you can still use std::memory_order to get the memory barriers generated that are required. These are going to obviously be lock free, they deal with memory ordering—what you tried to deal with volatile, but in general case.
See: https://en.cppreference.com/w/cpp/atomic/atomic/store
Effectively std::atomic stores and loads generate volatile accesses plus the required memory barriers to get the desired behavior.
And yes, that's not stupid. It's actually faster than all the register "optimizations" for practical use cases in fast VM's. Register saving across calls and at the GC is much more expensive. mem2reg is an antipattern mostly
I think a good way of summarizing volatile is this slide from my parallel architectures class [1]:
> Class exercise: describe everything that might occur during the
> execution of this statement
> volatile int x = 10
>
> 1. Write to memory
>
> Now describe everything that might occur during the execution of
> this statement
> int x = 10
>
> 1. Virtual address to physical address conversion (TLB lookup)
> 2. TLB miss
> 3. TLB update (might involve OS)
> 4. OS may need to swap in page to get the appropriate page
> table (load from disk to physical address)
> 5. Cache lookup (tag check)
> 6. Determine line not in cache (need to generate BusRdX)
> 7. Arbitrate for bus
> 8. Win bus, place address, command on bus
> 9. All caches perform snoop (e.g., invalidate their local
> copies of the relevant line)
> 10. Another cache or memory decides it must respond (let’s
> assume it’s memory)
> 11. Memory request sent to memory controller
> 12. Memory controller is itself a scheduler
> 13. Memory controller checks active row in DRAM row buffer.
> (May > need to activate new DRAM row. Let’s assume it does.)
> 14. DRAM reads values into row buffer
> 15. Memory arbitrates for data bus
> 16. Memory wins bus
> 17. Memory puts data on bus
> 18. Requesting cache grabs data, updates cache line and tags,
> moves line into exclusive state
> 19. Processor is notified data exists
> 20. Instruction proceeds
> * This list is certainly not complete, it’s just
> what I came up with off the top of my head.
It's also worth mentioning that this assumes a uniprocessor model, so out-of-order execution is still possible which leads to complications in any sort of multithreaded or networked system (See #5, 6, 7, 8 in the OP article).I think a lot of the confusion stems from the illusion that a uniprocessor + in-order execution model implies to programmers who have never dealt with system-level code. I think in the future, performant software will require a bit more understanding of the underlying hardware on the part of your average software developer -- especially when you care about any sort of parallelism. It doesn't help that almost all common CS curriculum ignores parallelism until the 3rd year or more.
[1] http://www.cs.cmu.edu/~418/lectures/12_snoopimpl.pdf - the last 2 slides
Looking at the formatting on the actual slides, I think the 1st is meant to be a question, and the 2nd is the answer. That the first contains the word "volatile" and the second doesn't looks to me like an editing error; they probably both said "volatile" at one time (or didn't) and the proof failed to update one when updating the other.
Isn't it sobering to think that a university slide could have a minor error like that, someone could read it and internalise it as being very important, and then go off and ask interview questions about it (as suggested on the slide!!!!) for the rest of their career!
(Not the fault of the student in this thread, of course.)
The slide is admittedly a bit vague, the point is mostly to convey "lots of complicated things that you probably haven't considered are going on in the background to speed up memory accesses in a uniprocessor model." Keep in mind the class is exploring parallel architectures, and that lecture is about snooping-based cache coherence.
if you want to force actually to ram then perhaps you'd need a memory barrier.
This is not my area though. Wrong? Right?
volatile int x;
int y;
int z;
x = 10;
x = 20;
y = x;
z = x;
Answer: the constant 10 is written to x
the constant 20 is written to x
the contents of x is read and written into y
the contents of x is read and written into z
Now, what happens with this code? int x;
int y;
int z;
x = 10;
x = 20;
y = x;
z = x;
One answer is the same as the above. Another valid answer is: the constant 20 is written to x
the constant 20 is written to y
the constant 20 is written to z
Why? Because x is not used between the two assignments, so the first will never be seen. Also, x is not used between it's assignment and the assignment to y, so the compiler can do constant propagation.All volatile does it tell the compiler "all writes must happen, and no caching of reads".
fun fact: either swap+loopback devices+FUSE/network filesystems or userfaultfd means arbitrary userspace code execution including IO to remote machines might occur.
That has many benefits, among them the ability to store its value in registers.
If you want lock-free what do you suggest we use instead of volatile?
> which is usually much worse than holding a lock mutex in registers
How can you hold a mutex in a register? That doesn't make any sense.
Lock free in Java is usually worse that what the JVM can pull off with lock elison
For a trivial example, see this code:
int f() {
int sum = 0;
for (int i = 0; i < 10; i++) sum += i;
return sum;
}
As you can see from [1], a smart compiler will calculate the sum at compile time and make the function simply return the resulting number (i.e., no loop is generated).If you make "sum" volatile, the compiler is forced to do the loop[2].