undefined | Better HN

0 pointsnkurz10y ago0 comments

I understand that most others share your position, and mostly agree when it comes to volatile variables. I'd also agree with you if removing "volatile" caused the code to break. But I think that it can be necessary for performance, and don't think that there are true downsides. I believe if you are using assembly it is because you don't want the compiler to attempt any further optimizations. For the cases when I want to drop to assembly, it's because I've already decided the register allocation and instruction ordering I want, and will verify the assembly that is generated.

My goal is to "lock in" an established level of performance once I've achieved it, so that compiler upgrades or changes don't result in performance drops. I often compare the output of multiple compilers with a matrix of optimization flags, choose the best blocks from each, and then hand-optimize from there while cross-referencing Agner's handbooks with Likwid's performance reports. If I've chosen to use inline assembly, the chances that the compiler will succeed in further optimizing my code is very low.

I realize it's not a popular view, but I think that using volatile with __asm is usually the correct approach. If you don't need "volatile", you probably should be using an intrinsic instead. I think the alternative (which may in fact be the better solution) is dropping to straight assembly for the entire function or distributing binary code.

0 comments

klodolph10y ago

Yes, that is really a better solution: to write the whole function in assembly. "volatile" is just a poor substitute for that.

nkurzOP10y ago

Other than the "code smell", what do you see as the main dangers of using "__asm volatile" rather than just "__asm"? Assuming that there are cases where I do get significantly better performance from specifying the exact ordering of instructions, what can I do to minimize these dangers while keeping the better performance?

klodolph10y ago

The first danger is that "asm volatile" is basically a hack to get the output you want from the compiler. But the compiler is a rather complicated piece of software, and there is no guarantee that future versions of the compiler will still give you the desired output. Perhaps it works correctly now, but if you change your optimization settings are you sure that something unexpected won't happen? Remember that "asm volatile" can still be moved around. From the GCC manual[1]:

> Do not expect a sequence of asm statements to remain perfectly consecutive after compilation, even when you are using the volatile qualifier. If certain instructions need to remain consecutive in the output, put them in a single multi-instruction asm statement.

The second danger is that "asm volatile" hides incorrect operand specification. If you examine the assembly, you might get the wrong assembly, and adding "volatile" might fix it. However, the incorrect operand specification might cause problems in other parts of the code. These are harder to diagnose. Stack Overflow is littered with questions by people who specify asm operands wrong, add "volatile" to fix the assembly, but other things are still broken. My general procedure is to work with asm blocks at -O2 or higher without using volatile, and make sure I'm getting the desired results that way (unless I'm writing some synchronization primitives).

Yet it is just so damn easy to write larger, multi-statement asm blocks. With larger blocks, the intent of the programmer is clear. It becomes obvious to both the reader and to the compiler that the assembly should be emitted as-is, rather than moved or reordered.

Finally, you can often get the results you want with the auto-vectorizer, restrict, and __builtin_assume_aligned. Whenever that is possible I'd prefer it.

[1]: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

j / k navigate · click thread line to collapse

0 comments

klodolph10y ago

Yes, that is really a better solution: to write the whole function in assembly. "volatile" is just a poor substitute for that.

nkurzOP10y ago

klodolph10y ago

Finally, you can often get the results you want with the auto-vectorizer, restrict, and __builtin_assume_aligned. Whenever that is possible I'd prefer it.

[1]: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

j / k navigate · click thread line to collapse