As the paper notes, there are plenty of alternative C compilers available to choose from. The reason why GCC and LLVM ended up attaining overwhelming market share is simply that they produce the fastest possible code, because, at the end of the day, that is what users want.
If you want to blame someone, blame the designers of the C language for doing things like making int the natural idiom to iterate over arrays even when size_t would be better. The fact that C programmers continue to write "for (int i = 0; i < n; i++)" to iterate over an array is why signed overflow is undefined, and it is absolutely a critical optimization in practice.
No, I think it's more because they are free.
In my experience, ICC can be much better at instruction selection while also not being so crazy with exploiting UB.
Besides, it's irrelevant: there are lots of free C compilers that don't exploit UB, and also rarely get used.
That might mean the differences have mostly disappeared, but that may depend on what the front end (icx vs clang) does.
Well, size_t is unsigned and has defined overflow, so you'd lose the optimization if you switched to it. (Specifically, there's cases where defining overflow means a loop is possibly infinite, which blocks all kinds of optimizations.)
Many languages try to fix this by defaulting to wrap on overflow, but that was a mistake because you rarely actually want that. A better solution is to have a loop iteration statement that doesn't have an explicit "int i" or "i++" written out.
size_t obviates the need for this optimization.
As for the optimization, it is based on misunderstanding of C semantics. The only place where the sign extend makes a difference is where pointers are longer than "ints" AND where the the iterator can overflow, and in that case, sign extend only makes a difference if the loop is incorrectly coded so that the end condition is never true. The code should just provoke a warning and then omit the sign extend (and it almost certainly doesn't make much of a difference since sign extend is highly optimized and has zero cost in a pipelined processor).
That said, I believe the set of undefined behaviours in our current standards is much, much too large - most of these should rightly be filed in the implementation-defined category instead. It is no longer the 70s and the very same modern compilers that perform more and more extreme optimisations year on year really do not need to account for a giant zoo of quirky experimental architectures; over the decades we've basically settled on a consensus on how pointers, integers, floating-point numbers etc ought to work.
There is a large population of C "real programmers" who, when they write a C program that unsurprisingly doesn't work, conclude this must be somebody else's fault. After all, as a real programmer they certainly meant for their program to work, and so the fact it doesn't can't very well be their fault.
Such programmers tend to favour very terse styles, because if you don't write much then it can't be said to be your fault when, invariably, it doesn't do what you intended. It must instead be somebody else's fault for misunderstanding. The compiler is wrong, the thing you wanted was the obviously and indeed only correct interpretation and the compiler is willfully misinterpreting your program as written.
Such programmers of course don't want an error diagnostic. Their program doesn't have an error, it's correct. The compilers are wrong. Likewise newer better languages are unsuitable because the terse, meaningless programs won't compile in such languages. New languages often demand specificity, the real programmer is obliged to spell out what they meant, which introduces the possibility that they're capable of mistakes because what they meant was plain wrong.
At least the warnings are getting a bit better for some of these.
In Fortran the aliasing rules are even stricter: given two arrays passed in as arguments the compiler can assume that they do not overlap, for example. I remember messing that up as a student long ago and getting strange results. The Fortran rule was to enable vectorization, which has been done for many decades.
In any case, for people who want to write OS or cryptography or embedded systems or arithmetic libraries or ... in C, this is not a relevant point.
How much slower would removing the UB footgun make scientific code? Do you have benchmarks? Because we have endless firsthand accounts that it makes standard C useless.
To this it's impossible to write C without using a ton of non standard attributes and compiler options to just make it do the correct thing.
That is what it does. "Undefined behavior" is a lack of semantics, so it is preserving semantics when it leaves those paths out. You can make a "defined C" with `-fsanitize-trap=undefined`, but C was never a high level assembler, and performance is critical for C users too.
This gets more complicated when an inlined function calls another inlined function. The whole inner function may be in the middle of dead code, and thus everything it calls, too, and so all be elided. This sort of thing happens all the time. These elisions don't typically make code obviously much faster, cycle-wise, but they do make it smaller, with a smaller cache footprint. Cache footprint has a huge effect on overall performance.
In principle, the compiler could arrange to put the dead code in some other page, with a conditional jump there that never happens, consuming just another branch prediction slot. But branch prediction slots are cache, and the branch, though never taken, nonetheless consumes a slot. A processor ISA could, in principle, have a conditional branch instruction that is assumed by branch prediction always to go one way, and so not need to consume any branch prediction cache. But I don't know of any. RISC-V does not seem to be entertaining plans to specify such a branch.
Several extant chips do allow "hint" prefixes for branches, but I think they are ignored in current chips, according to experience where they were determined generally to be wrong. This is unfortunate, as sometimes the most frequently taken direction is not the one you want to be fastest. E.g., when spinning while watching an atomic flag, you want the looping branch to be assumed not taken, to minimize latency once the flag is clear, even though it most frequently is taken in recent history. (High-frequency trading code often has to resort to trickery to get the desired behavior.)
(There is a famous story about engineers at Digital Equipment Corporation, whence we got PDP-11s and Vaxen, and thus, indirectly, Unix. A comprehensive cycle count showed that a huge proportion of the instructions executed in their OS kernel were just one instruction. They moved heaven and earth to optimize this one instruction, but the kernel with the new instruction was exactly 0% faster. It turned out the instruction was used only in the idle loop that ran while waiting until something useful could be done. This really happened, and the same mistake is made again and again, to this day: e.g., people will sincerely swear that "rotl" and "popcnt" instructions are not important, on about the same basis.)
Intel recently implemented an atomic test-and-loop instruction that just stalls execution until an identified cache line is touched, thus avoiding the branch prediction stall getting out of the loop. I have not heard of anybody using it.
The reason they claim program optimizations aren't important is because you can do it by hand for a specific architecture pretty easily, but you'll still want them when porting to a new one, eg if it wants loop counters to go in the opposite direction.
Does it? Honest question; my impression was that template-heavy code can tend to produce deep call trees, but not necessarily outright unreachable code unless you count instantiations ruled out by SFINAE/std::enable_if/tag dispatching, for which UB-based analyses are not necessary.
In addition, I thought template (meta)programming relied very heavily on compile-time knowledge, which seems to obviate the need for UB-based analyses in many cases.
I'm not particularly experienced, though, so maybe there's a gaping hole I'm missing.
But C code is often heavily inlined, too, and similarly pruned.
The submitter may have left out "development" to fit the title in the character limit.
These optimizations leave for C++ folks. C doesn't need all of that, just leave it as the "high level assembler" that it was in the old days, where if I write an instruction I can picture the assembler output in my mind.
Optimizers should not change the code semantics to me. Unfortunately with gcc it's impossible to rely on optimizations, so the only safe option is to turn them off entirely (-O0).
Or even define the behavior and get your compiler writer to implement it.
ps: If I index past the end of the array... what behaviour are you going to define?
Yes, it does. In fact C needs it more, because of the "for (int i = 0; i < n; i++)" idiom. At least idiomatic C++ code uses iterators.
I know people recommend rust for this kind of thing, but Rust really isn't appropriate in a lot of cases, especially when dealing with microcontrollers not supported by llvm (ie PIC, 8051 off the top of my head).
This may be changing, but I was also under the impression that Rust can't easily produce as small binaries as C can.
We'll wait...
The problem is that C is sufficiently primitive that optimizing it is effectively trying infer structure backwards because the language doesn't specify it.
For-loops are a great example. Why are we even discussing about a "loop index"? Because we need to iterate over an array, string, etc. and "length" isn't something stored with the data structure so the compiler can't do the loop for you.
The real question is probably more along the lines of "Should the C standard start pinning things down more than it does?"
The answer is probably yes, but it's not straightforward to do. Look at just how much verbiage it took to create a decent memory model. And, still, using atomics in C (as opposed to C++) is brutal and the compiler can't help you very much.
My favourite is how "x + 1 < x" is optimized away for signed ...