Everything in C is undefined behavior (opens in new tab)

(blog.habets.se)

505 pointslycopodiopsida1mo ago718 comments

718 comments

277 comments · 73 top-level

muvlon1mo ago· 36 in thread

Yes there is tons of surprising and weird UB in C, but this article doesn't do a great job of showcasing it. It barely scratches the surface.

Here's a way weirder example:

  volatile int x = 5;
  printf("%d in hex is 0x%x.\n", x, x);

This is totally fine if x is just an int, but the volatile makes it UB. Why? 5.1.2.4.1 says any volatile access - including just reading it - is a side effect. 6.5.1.2 says that unsequenced side effects on the same scalar object (in this case, x) are UB. 6.5.3.3.8 tells us that the evaluations of function arguments are indeterminately sequenced w.r.t. each other.

So in common parlance, a "data race" is any concurrent accesses to the same object from different threads, at least one of which is a write. In C, we can have a data race on a single thread and without any writes!

thomashabets21mo ago

Author here.

> It barely scratches the surface.

I agree. The point of the post is not to enumerate and explain the implications of all 283 uses of the word "undefined" in the standard. Nor enumerate all the things that are undefined by omission.

The point of the post is to say it's not possible to avoid them. Or at least, no human since the invention of C in 1972 has.

And if it's not succeeded for 54 years, "try harder", or "just never make a mistake", is at least not the solution.

The (one!) exploitable flaw found by Mythos in OpenBSD was an impressive endorsement of the OpenBSD developers, and yet as the post says, I pointed it at the simplest of their code and found a heap of UB.

Now, is it exploitable that `find` also reads the uninitialized auto variable `status` (UB) from a `waitpid(&status)` before checking if `waitpid()` returned error? (not reported) I can't imagine an architecture or compiler where it would be, no.

FTA:

> The following is not an attempt at enumerating all the UB in the world. It’s merely making the case that UB is everywhere, and if nobody can do it right, how is it even fair to blame the programmer? My point is that ALL nontrivial C and C++ code has UB.

muvlon1mo ago

Fair enough!

> And if it's not succeeded for 54 years, "try harder", or "just never make a mistake", is at least not the solution.

And I 100% agree. UB is way overused by these standards for how dangerous it is, and as a consequence using C (and C++) for anything nontrivial amounts to navigating a minefield.

3 more replies

lelanthran1mo ago

> The point of the post is to say it's not possible to avoid them. Or at least, no human since the invention of C in 1972 has.

What are you talking about? UB was coined only in the first C standard, in 1989. Prior to that there was no "If you do this, anything can happen". It was "If you do this, that will happen".

2 more replies

saghm1mo ago

> if nobody can do it right, how is it even fair to blame the programmer? My point is that ALL

It's fair to blame the programmer for the choice of programming in a language like this, if it was in fact their choice. As you've so eloquently put, choosing those languages is essentially equivalent to choosing UB, so starting a new project with one of them is 100% blameworthy when the UB is inevitably found.

1 more reply

tialaramex1mo ago

Volatile is a type system hack. They should have done a more principled fix, and certainly modern languages should not act as though "C did it" makes it a good idea.

The reason for the hack is that very early C compilers just always spill, so you can write MMIO driver code by setting a pointer to point at the MMIO hardware and it actually works because every time you change x the CPU instruction performs a memory write.

Once C compilers got some basic optimisations that obvious "clever" trick stops working because the compiler can see that we're just modifying x over, and over and over, and so it doesn't spill x from a register and the driver doesn't work properly. C's "volatile" keyword is a hack saying "OK compiler, forget that optimisation" which was presumably a few minutes work to implement, whereas the correct fix, providing MMIO intrinsics in the associated library, was a lot of work.

Why should you want intrinsics here? Intrinsics let you actually spell out what's possible and what isn't. On some targets we can actually do a 1-byte 2-byte and 4-byte write, those are distinct operations and the hardware knows, so e.g. maybe some device expects a 4-byte RGBA write and so if you emit four 1-byte writes that's very confusing and maybe it doesn't work, don't do that. On some targets bit-level writes are available, you can say OK, MMIO write to bit 4 of address 0x1234 and it will write a single bit. If you only have volatile there's no way to know what happens or what it means.

HarHarVeryFunny1mo ago

I agree that marking the read/write as special rather than the variable itself would be nice, although it would also be nice if C/C++ was more consistent in the way things like this are done. Maybe given std::atomic and std::mutex as template/library features, supported by compiler intrinsics, it would be nice to have "volatile" supported in a similar way.

As a nit pick, I don't think this is correct use of "spill". Register spilling refers to when a compiler's code generator runs out of registers and needs to store variables in memory instead. In the MMIO case you are reading/writing via a pointer, so this is unrelated to registers and spilling behavior.

1 more reply

MobiusHorizons1mo ago

By MMIO semantics do you mean explicit load and store instructions? I’ve never felt that pointer reads or writes were lacking descriptiveness here. I would argue the only surprising thing is that they might be optimized out (which is what volatile prevents).

Volatile on a non pointer value is not for MMIO, though, that’s typically for concurrency like with interrupts.

1 more reply

rcxdude1mo ago

Yeah, it's also cleaner to be able to mark particular reads and writes as having side effects as opposed to having it be a property of the variable.

tardedmeme1mo ago

Thr Linux kernel uses READ_ONCE and WEITE_ONCE which look like actual function calls which is very sensible.

saagarjha1mo ago

> The reason for the hack is that very early C compilers just always spill, so you can write MMIO driver code by setting a pointer to point at the MMIO hardware and it actually works because every time you change x the CPU instruction performs a memory write.

Source?

2 more replies

pron1mo ago

> In C, we can have a data race on a single thread and without any writes!

You need to distinguish between a UB and a race, and I think that's something that discussions of UB miss. Take any C program and compile it. Then disassemble it. You end up with an Assembly program that doesn't have any UB, because Assembly doesn't have UB.

UB is a property of a source program, not the executable. It means that the spec for the language in which the source is written doesn't assign it any meaning. But the executable that's the result of compiling the program does have a meaning assigned to it by the machine's spec, as machine code doesn't have UB.

A race is a property of the behaviour of a program. So it's true to say that your C program has UB, but the executable won't actually have a race. Of course, a C compiler can compile a program with UB in any way it likes so it's possible it will introduce a race, but if it chooses to compile the program in a way that doesn't introduces another thread, then there won't be a race.

muvlon1mo ago

I specifically said data race, which is a known term of art and a type of language-level UB. It is separate from the races you're thinking about. Just like signed integer overflow or use-after-free, the compiler is allowed to assume data races never happen.

pjmlp1mo ago

The problem is that in the quest to win benchmark games, compilers started to take advantage of UB for all kinds of possible optimizations, which is almost as deterministic as LLM generated code, across compiler version updates.

1 more reply

simonask1mo ago

I think the article's point is that you don't actually have to get weird at all to run into UB.

Lots of people mistakenly think that C and C++ are "really flexible" because they let you do "what you want". The truth of the matter is that almost every fancy, powerful thing you think you can do is an absolute minefield of UB.

kzrdude1mo ago

My go-to example of "UB is everywhere" is this one:

    int increment(int x) {
        return x + 1;
    }

Which is UB for certain values of x.

2 more replies

saghm1mo ago

I've long said that the value a programming language offers is as much about what it doesn't allow as what it does allow. Efficiency aside, most useful programs could be written in most languages, but there are an infinite number of programs you could write that aren't particularly useful. Ruling out the programs you might accidentally write that resemble the one you intended is a pretty useful feature of a language, and it's a metric that C and C++ rate quite poorly on IMO.

jstimpfle1mo ago

I would agree that C is "really flexible", but I would say it's primarily flexible because it lets you cast say from a void pointer to a typed pointer without requiring much boilerplate. It's also flexible because it lets you control memory layout and resource management patterns quite closely.

If you want to be standards correct, yes you have to know the standard well. True. And you can always slip, and learn another gotcha. Also true. But it's still extremely flexible.

2 more replies

3form1mo ago

At which point it feels like some sort of high-level assembly-like language, which is simple enough to compile efficiently and stay crossplatform, with some primitives for calls, jumps, etc. could find a nice niche.

Maybe this already exists, even? A stripped down version of C? A more advanced LLVM IR? I feel like this is a problem that could use a resolution, just maybe not with enough of a scale for anyone to bother, vs. learning C, assembly of given architecture, or one of the new and fancy compiled languages.

3 more replies

HarHarVeryFunny1mo ago

> In C, we can have a data race on a single thread and without any writes!

Well, sure, that's what volatile means - that the value may be changed by something else. If it's a global variable then the something else might be an interrupt or signal handler, not just another thread. If it's a pointer to something (i.e. read from a specific address) then that could be a hardware device register who's value is changing.

The concept of a volatile variable isn't the problem - any language that is going to support writing interrupt routines and memory mapped I/O needs to have some way of telling the compiler "don't optimize this out" since reading from the same hardware device register twice isn't like reading from the same memory location twice.

I think the problem here is more that not all of the interactions between language features and restrictions have been fully thought out. It's pretty stupid to be able to explicity tell the language "this value can change at any time", and for it to still consider certain uses of that value as UB since it can change at any time! There should have been a carve out in the "unsequenced side effect" definitions for volatile variables.

vlovich1231mo ago

> There should have been a carve out in the "unsequenced side effect" definitions for volatile variables.

As noted, there’s almost 300 usages of the word undefined in the standard. Believing that it’s possible to correctly define all the carve outs necessary correctly and have the compiler implement the carve outs successfully is about as logical as believing UB is humanly avoidable in written code.

mananaysiempre1mo ago

And it makes sense as long as you allow the concept of unsequenced operations at all (admittedly it’s somewhat rare; e.g. in Scheme such things are defined to still occur in sequence, but which specific sequence is unspecified and potentially different each time). The “volatile” annotation marks your variable as being an MMIO register or something of that nature, something that could change at any point for reasons outside of the compiler’s control. Naturally, this means all of the hazards of concurrent modification are potentially there.

That said, your “common parlance” definition of “data race” is not the definition used by the C standard, so your last sentence is at best misleading in a discussion of standard C.

> The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.

(Here “conflicting” and “happens before” are defined in the preceding text.)

tsimionescu1mo ago

Your first paragraph makes it sound as if the compiler will actually generate two reads of the value of some register, which might lead to unexpected effects at runtime for certain special registers.

However, this is not at all what UB means in C (or C++). The compiler is free to optimize away the entire block of code where this printf() sequence occurs, by the logic that it would be UB if the program were to ever reach it.

For example, the following program:

  int y = rand();
  if (y != 8) {
    volatile int x;
    printf("%d: %d", x, x) ;
  } else {
    printf("y is 8");
  }

Can be optimized to always print "y is 8" by a perfectly standard compliant compiler.

2 more replies

rocketrascal1mo ago

Are you sure?

>unsequenced side effects on the same scalar object are UB

>6.5.3.3.8 tells us that the evaluations of function arguments are indeterminately sequenced w.r.t. each other.

Read 5.1.2.4.3:

"If A is not sequenced before or after B, then A and B are unsequenced."

"Evaluations A and B are indeterminately sequenced when A is sequenced either before or after B, but it is unspecified which."

With a footnote saying this:

"9)The executions of unsequenced evaluations can interleave. Indeterminately sequenced evaluations cannot interleave, but can be executed in any order."

I.e the standard makes a distinction between "unsequenced" and "indeterminately sequenced". And with no mention of side effects on "indeterminately sequenced" being UB it leads me to conclude that your example is not UB.

berti1mo ago

Reading a register from a microcontroller peripheral may well reset it as an example of a possible side-effect here, and that's exactly the kind of thing you use volatile for.

smsm421mo ago

I think C standard doesn't do itself any favors by using "undefined behavior" to signify both "anything can happen, including erasing all your data and setting your data center on fire" and "one of the very small and well defined set of things would happen, but we can not commit to which one". The latter is not exactly great, but significantly less dangerous than the former.

zahlman1mo ago

> Here's a way weirder example:

Well, yes; but when the C standard authors wrote like this, they surely had in mind "the reads could be in either order, therefore the output could display the polled values in either order". Not C++ nasal demons.

And yeah, being able to say "reading is a side effect" is important when for example you interact with certain memory-mapped devices.

sethev1mo ago

Yes, there is a data race there. The value of a volatile can be changed by something outside the current thread. That’s what volatile means and why it exists.

Edit: thread=thread of execution. I’m not making a point about thread safety within a program.

mananaysiempre1mo ago

Not from the standard’s point of view. The traditional (in some circles) use of volatile for atomic variables was not sanctioned by the C11/C++11 thread model; if you want an atomic, write atomic, not volatile, or be aware of your dependency on a compiler (like MSVC) that explicitly amends the language definition so as to allow cross-thread access to volatile variables.

1 more reply

trissylegs1mo ago

Can also represent a register that has an effect reading it. Reading a memory mapped register can have side effects. Like memory mapped io on a UART will fetch the next byte to be read.

frollogaston1mo ago

Was going to say the same thing until I saw this comment. volatile is defined the way I'd expect, plus it's a strange code example.

jstimpfle1mo ago

Not sure why you're being downvoted. That's completely right. The example is silly. The code is obviously bad, doesn't matter if it's UB or not.

I'm also not convinced (yet) that the example really is UB: I agree reading a volatile is "a side effect" in some sense, and GP cited a paragraph that says just that. But GP doesn't clearly quote that it's a side effect on the object (or how a side effect on an object is defined). Reading an object doesn't mutate it after all.

But whatever language lawyer things, the code is obviously broken, with an obvious fix, so I'm not so interested in what its semantics should be. Here is the fix:

    volatile int x;
    // ...
    int val = x;  // volatile read
    printf("%x %d\n", val, val);

1 more reply

rramadass1mo ago

This has got nothing to do with data races etc. but everything to do with "Sequence Points and Single Update Rule" which is well described in C language specification.

See my comment here - https://news.ycombinator.com/item?id=48205760

RobotToaster1mo ago

With volatile it could be changed by an interrupt service routine between reads, so it makes sense.

nomel1mo ago

Or, it could be hardware that has a "clear flag on read" type behavior.

drysine1mo ago

What's weird about it?

If you are using volatile you are reading from a device port mapped to that address.

Since C doesn't mandate in which order function arguments are evaluated, you don't know which argument will be read from port first.

How can that be anything but UB?

imtringued1mo ago

Memory mapped IO sends a read request to a peripheral which is allowed have side effects in the background and return two different values upon a read. You can think of it as a synchronous RPC request.

The lack of argument sequencing feels utterly petty however.

parasti1mo ago· 25 in thread

I have never in my 20 years of writing C heard so much about undefined behavior as I have in the past 6 months on Hacker News. It has never entered the conversation. You write the code. If it doesn't work, you debug it and apply a fix or a workaround. Why does the idea of undefined behavior in C get to the front page so consistently?

summa_tech1mo ago

Hacker News is still skewed towards people interested in programming languages (as opposed to actually programming). Probably some sort of Y-combinator Lisp heritage. There's also a persistent minority of CS grads who think that developing / using new programming languages is the most fascinating thing in the world, and some of them hold on to that thought.

It's reasonable that such people would also be interested in design aspects of languages, and UB in C is in that field. Though I would argue that a lot of it was originally accommodating old CPU architectures without compromising performance too badly, and about as much a "design choice" as wheels being round...

defgeneric1mo ago

There was also a period around the mid-2010s where I had the strong impression that lots of younger ambitious devs were fanatically promoting rust against C's undefined behavior mostly because it gave them a way to differentiate themselves from older seniors within organizations. (And I say this not as an old C diehard, but as someone who watched more than one colleague position himself as the 'rust guy'.)

simonask1mo ago

Excuse me, what? I was writing both C and C++ 20 years ago, and UB was a huge part of the conversation (and the curriculum) back then as well.

There were a few high-profile "scandals" around GCC 3.2 (IIRC) because the compiler finally started much more aggressively using UB in optimizations, which was a reason that lots of people stayed on GCC 2.95 for a very long time. GCC 3.2 came out in 2002.

parasti1mo ago

Started in 2005. Never ever did anyone complain about UB in my years of writing C code and patching other people's C code. I knew it exists - as a spec quirk. (Admittedly, never wrote a compiler and never used anything except gcc and clang.)

keyle1mo ago

Computers used to be cool; now they're dangerous.

Every company keep harping on about safety and being exposed (being in the news): so the narrative against 'unsafe' is up the wazoo.

The new world is basically a bunch of city dwellers who haven't seen raw nature and you show them a lawn mower, they freak out. Blades that spin?!?!?! Madness!!

pjc501mo ago

If everything is going to be dependent on computers, it's probably important that they work and remain under their owner's control rather than whichever NK or Chinese hacker group gets to them first.

Can't talk about C without CVE.

1 more reply

Etheryte1mo ago

Because the production environment might be a completely different architecture, these details matter a lot. Works on my machine is not useful if your actual target is a small embedded system on top of a cell tower in the middle of nowhere. Granted, most people don't work on stuff like that, I imagine the vast majority of devs here are web developers, but even still it's an interesting discussion even if you haven't run into it yourself. Maybe even more so in that case.

spacedcowboy1mo ago

Um, as an embedded developer, you don't develop the code to run on your machine, you develop it to run on the same target as you expect to deploy to, sitting on your desk next to you.

I have lots of my code running day-in, day-out on literally hundreds of millions of machines. The approach to "getting it working" is exactly OP's.

I'll admit to being pretty defensive and anal in checking values and return-codes (more so than most, I suspect), and I'm a firm believer in KISS principles in software engineering ("solving hard problems with complicated code is easy, solving them with simple, understandable algorithms is the hard bit") but generally there's no real difference in approach to the code I write to work on my workstation, and the code I write to work in the field.

1 more reply

bregma1mo ago

    There are more things in heaven and earth, Horatio,
    Than are dreamt of in your philosophy

You've probably been churning out possibly malformed code for years. Now you're becoming aware of your shortcomings. This is usually considered the transition from intermediate- to senior-level programmer.

rramadass1mo ago

Because most of the people who post/write these articles do not actually know the C language specification nor understand its design.

Understanding three important concepts properly in C allows one to easily identify what can/cannot result in UB viz. 1) Expressions 2) Statements 3) Sequence Points and "Single Update Rule". It is not that hard at all.

I wrote about it here with links to further reading provided - https://news.ycombinator.com/item?id=48144734

Pannoniae1mo ago

Exactly, you write for your target, not some imaginary spec. The spec is only as useful as to predict what your target roughly does, it's not normative.

Compilers might have bugs where the spec is supposed to work but it doesn't, and many extensions without standard equivalents, or implementation-specific behaviour where undefined things in the standard do get assigned a meaningful outcome.

SomeoneOnTheWeb1mo ago

I have the opposite experience, so many subtle bugs that bite you only on specific scenarios, so much that I can't count.

sethev1mo ago

I wonder if it’s just the colorful metaphors and an opportunity to bring out examples of surprising behavior. Plus it’s a topic that can always stir up debates.

dminik1mo ago

If only it was that easy: https://silentsblog.com/2025/04/23/gta-san-andreas-win11-24h...

The real answer is that proponents of languages like C seem to completely disregard the dangers/difficulty of hitting/difficulty of fixing UB. Proponents of languages like Rust overstate it instead. Pointless wars/drama is fun to read and gets clicks.

AndriyKunitsyn1mo ago

So, you never iterated past an array, you never used after a free(), you never tried doing i = ++i + ++i; ?

aldanor1mo ago

If there's no UBs then what will we programmers do, there won't be enough to debug and fix?

benj1111mo ago

1. It's been talked about for much longer than that.

2. You don't really appreciate the issue. Signed integer overflow is undefined. If you check for that overflow after the fact the compiler can, and demonstrably has pretended that the overflow can't happen and optimised away your overflow check.

You may not even come across that failure mode to know to 'fix' it. And good luck finding the issue unless you know about UB and what the compiler can and will do in such situations.

jakobnissen1mo ago

I would guess that the continued success of Rust have shown that we don’t have to live with the user-hostility of C in order to write system programs. Therefore, people are understandably growing less and less patient with C and its unending bullshit.

Although I haven’t noticed a spike the last 6 months, just a slowly increasing realization that C isn’t fit for humans and should go the way of asbest: Don’t use it for anything new, and remove it where it already exists, unless doing so would be too expensive or disruptive.

benj1111mo ago

I don't think C is hostile. C has UB for good reason. The problem is UB has been hijacked by the compiler writers for performance gains.

Personally I like C because you should have a good idea of what it's going to do. Other languages feel like a black box, and I start having to fight them far too often. But I say that as a hacker of low level stuff, not as someone who's paid and working on higher level stuff, so that is probably a niche view.

hedora1mo ago

There was a similar rush of articles like this a few years ago.

tl;dr: C defined language semantics, and leaves some behavior undefined. Each system that C is ported to has the ability to define the behavior however it wants.

This blows the mind of PL folks every decade or so.

It’s cool that we have portable methods and formal language semantics for stuff like memory fences and atomics now, but that sort of thing worked fine in C back in 1970 (or else unix would not have worked). You just needed to read the target machine’s manual when porting stuff.

The modern version is arguably better, but also arguably worse. Does anyone else remember when the JVM got this stuff wrong, making safe multithreaded code impossible, and then later had to break compatibility with the language spec?

You could claim that we can’t trust hardware folks to get instruction semantics right (this is demonstrably true), but duplicating and slightly modifying the specs in your language spec doesn’t actually fix the underlying hardware bugs.

Yeah, getting old… I’ll go find a cloud to yell at.

account421mo ago

There are a lot of Rust/whatever hipsters here that have defined their whole identity around hating C and C++.

virtualritz1mo ago

Like the author of the article, I write C/C++ since 30 years. Mostly close-to-the-metal code around computer graphics. Actually: wrote.

After switching to Rust five years ago I agree with all the Rust hipsters as far as disliking those languages go.

I just don't talk about it a lot. If every Rust person I know that was a C/C++ developer before was as outspoken about what they think of the latter, you'd see that these people are a majority.

We're just old hands who like to use stuff that works. And most of us don't get attached to code or languages.

It's also difficult to admint to yourself that you were never in command of a language as far as UB/other footguns go, as much as you thought. Or ever, for your enire career. For me that self-realization about C/C++ (enabled by Rust) was a turning point.

Lately you can read about the dichotomy re. AI use.

I.e. developers who define them themselves through what they build/ideas are embracing LLMs; for what they can do.

I.e.: I am what I build.

Whereas developers for whom software engineering is a craft that defines them hate them openly.

I.e.: I am how I build.

Now this seems to suggest to me that maybe Rust developers who openly hate C/C++ squarely belong to the latter group whereas the silent ones belong to the former. It's builders vs programmers. Just different world views.

Also you can not dislike something and still not speak about it. Because you decided to not care.

pjmlp1mo ago

As C++ hipster since 1992, the problem is really C and any language that includes its semantics as subsets.

Just like TypeScript can't get rid of JavaScript WATs.

hnarn1mo ago

Ironically, by stereotyping ”Rust hipsters” you are painting yourself out as a stereotype as well. Knee-jerk comments like yours add nothing to the discussion. Rust exists for a reason, it solves real problems, but it’s not suitable for everything. These are indisputable facts and by discarding every mention of Rust as coming from ”hipsters” with no understanding, you are doing the exact same thing that you would accuse them of. ”Use Rust for everything” and ”Rust is useless for everything” are equally vapid and meaningless statements designed for nothing but trolling and showing ignorance.

1 more reply

kzrdude1mo ago

After the rise of Rust, it has gained more visibility? But some people were interested in C in this way long ago too, I used to hang out in some godforsaken irc channel where people competed in out-pedanticing each other over the C standard.

I trust your historical C usage was more productive than that..

quelsolaar1mo ago· 17 in thread

The 5 stages of learning about UB in C:

-Denial: "I know what signed overflow does on my machine."

-Anger: "This compiler is trash! why doesn't it just do what I say!?"

-Bargaining: "I'm submitting this proposal to wg14 to fix C..."

-Depression: "Can you rely on C code for anything?"

-Acceptance: "Just dont write UB."

matheusmoreira1mo ago

What stage is the "just make the compiler define the undefined" stage?

Unaligned access? Packed structs. Compiler will magically generate the correct code, as if it had always known how to do it right all along! Because it has, in fact, always known how to do it right. It just didn't.

Strict aliasing? Union type punning. Literally documented to work in any compiler that matters, despite the holy C standard never saying so. Alternatively, just disable it straight up: -fno-strict-aliasing. Enjoy reinterpreting memory as you see fit. You might hit some sharp edges here and there but they sure as hell aren't gonna be coming from the compiler.

Overflow? Just make it defined: -fwrapv. Replace +, -, * with __builtin_*_overflow while you're at it, and you even get explicit error checking for free. Nice functional interface. Generates efficient code too.

The "acceptance" stage is really "nobody sane actually cares about the C standard". The standard is garbage, only the compilers matter. And it turns out that compilers have plenty of extremely useful functions that let you side step most if not all of this. People just don't use this because they want to write "portable" "standard" C. The real acceptance is to break out of that mindset.

Somehow I built an entire lisp interpreter in freestanding C that actually managed to pass UBSan just by following the above logic. I was actually surprised at first: I expected it to crash and burn, but it didn't. So if I can do it, then anyone can do it too.

quelsolaar1mo ago

A lot of the Central UB can not be defined, because they rely on detection. In order to have a well defined behaviour (by the standard or the compiler) the implementation needs to first detect that the behaviour is triggered, this is often very tricky or expensive. Its easy to define that a program should halt, if it writes outside an array, but detecting if it does can be both slow and hard to implement. There are implementations that do, but they are rarely used outside of debugging.

A better way to think about UB is as a contract between developer and implementation, so that the implementations can more easily reason about the code. How would you optimize:

(x * 2) / 2

An optimizer can optimize this out for a signed integer, because it doesn't have to consider overflow, but with a unsigned integer it can not. UB is a big reason why C is the most power efficient high level language.

1 more reply

gpderetta1mo ago

> Unaligned access? Packed structs.

Packed structs are dangerous. You can do unaligned accesses through a packed type, but once you take the address of your misaligned int field, then you are back into UB territory. Very annoying in C++ when you try to pass the a misaligned field through what happens to be generic code that takes a const reference, as it will trigger a compiler warning. Unary operator+ is your friend.

1 more reply

Georgelemental1mo ago

> Strict aliasing? Union type punning. Literally documented to work in any compiler that matters, despite the holy C standard never saying so.

It does say so, actually, since C99 TC3 (DR 283).

1 more reply

lelanthran1mo ago

> What stage is the "just make the compiler define the undefined" stage?

It can be left as implementation defined, which means that the compiler can't simply do arbitrary things, it needs to document what it would do.

Take, for example, signed-integer overflow: currently a compiler can simply refuse to emit the code in one spot while emitting it in another spot in the same compilation unit! Making it IB means that the compiler vendor will be forced to define what happens when a signed-integer overflows, rather than just saying, as they do now, "you cannot do that, and if you do we can ignore it, correct it, replace it or simply travel back in time and corrupt your program".

> Somehow I built an entire lisp interpreter in freestanding C that actually managed to pass UBSan just by following the above logic. I was actually surprised at first: I expected it to crash and burn, but it didn't. So if I can do it, then anyone can do it too.

Same here; I built a few non-trivial things that passed the first attempt at tooling (valgrind, UBsan with tests, fuzzing, etc) with no UB issues found.

1 more reply

pjmlp1mo ago

Lost in the submission attempt to WG14.

thomashabets21mo ago

Author here.

> -Acceptance: "Just dont write UB."

The point of my article is that this is not possible. This cannot be our end state, as long as humans are the ones writing the code. No human can avoid writing UB in C/C++.

jart1mo ago

It's honestly not that difficult to be rigorous. The things you mentioned in the blog post are pretty obvious forms of degenerate practices once you get used to seeing them. The best way to make your argument would be to bring up pointer overflow being ub. What's great about undefined behavior is that the C language doesn't require you to care. You can play fast and loose as much as you want. You can even use implicit types and yolo your app, writing C that more closely resembles JavaScript, just like how traditional k&r c devs did back in the day under an ilp32 model. Then you add the rigor later if you care about it. For most stuff, like an experiment, we obviously don't care, but when I do, I can usually one shot a file without any UB (which I check by reading the assembly output after building it with UBSAN) except there's just one thing that I usually can't eliminate, which is the compiler generating code that checks for pointer overflow. Because that's just such a ridiculous concept on modern machines which have a 56 bit address space. Maybe it mattered when coding for platforms like i8086. I've seen almost no code that cares about this. I have to sometimes, in my C library. It's important that functions like memchr() for example don't say `for (char *p = data, *e = data + size; p<e; ...` and instead say `for (size_t i = 0; i < n; ++i) ...data[i]...`. But these are just the skills you get with mastery, which is what makes it fun. Oh speaking of which, another fun thing everyone misses is the pitfalls of vectorization. You have to venture off into UB land in order to get better performance. But readahead can get you into trouble if you're trying to scan something like a string that's at the end of a memory page, where the subsequent page isn't mapped. My other favorite thing is designing code in such a way that the stack frame of any given function never exceeds 4096 bytes, and using alloca in a bounded way that pokes pages if it must be exceeded. If you want to have a fun time experiencing why the trickiness of UB rules are the way they are, try writing your own malloc() function that uses shorts and having it be on the stack, so you can have dynamic memory in a signal handler.

2 more replies

frollogaston1mo ago

"Just don't write UB" sounds like still part of the bargaining stage at best

im3w1l1mo ago

In C, acceptance is "I will write UB and it will eventually lead to something bad happening"

superxpro121mo ago

Just work on embedded devices like I did lol. It's so nice to write software targeting a specific cpu.

Ygg21mo ago

> -Acceptance: "Just dont write UB."

Just switch to a saner language.

And before I get attacked for being a Rust shill, I meant Java :P

The bar is so low it's floating near the center of the Earth.

dns_snek1mo ago

> And before I get attacked for being a Rust shill, I meant Java :P

If all you want is C but less insane then the obvious answer here is Zig.

4 more replies

p2detar1mo ago

> Just switch to a saner language.

And where's the fun in that?

1 more reply

ErroneousBosh1mo ago

Okay, so Java compiles to machine code now?

Because the last time I looked it appeared to need some godawful slow bytecode interpreter that took up thousands of kilobytes of RAM.

4 more replies

17186274401mo ago

> -Denial: "I know what signed overflow does on my machine."

Or you just not skip the introductory pages, that tell you what the language philosophy of C is, and why there is UB. Yes, UB can be a struggle, but the first four steps are entirely unnecessary. It means that you do not actually understand the core concepts of the very same language you are using, which is kinda stupid.

whizzter1mo ago

I think the issue has been that the line between de-jure and de-facto behaviours has shifted over the years as compiler optimizations suddenly began relying on de-jure intrepretations of UB to increase performance while ignoring de-facto usage of the language.

When that started happened people became alarmed (oMG UB iS TeH BAD!) and since some old UB machines still had industry support (of organisations that actually participated in ISO meetings instead of arguing online) there was never any movement on defining de-facto usage as de-jure and the alarmist position became the default.

Personally I think the industry would've benefited from a Boring C (as described by DJB) push by people that would've created a public parallell "de-jure" standard that would've had a chance to be adopted by compiler creators.

1 more reply

beeforpork1mo ago· 14 in thread

The UB in unaligned pointers is even worse: an unaligned pointer in itself is UB, not only an access to it. So even implicit casting a void*v to an int*i (like 'i=v' in C or 'f(v)' when f() accepts an int*) is UB if the cast pointer is not aligned to int.

It is important to understand that this is a C level problem: if you have UB in your C program, then your C program is broken, i.e., it is formally invalid and wrong, because it is against the C language spec. UB is not on the HW, it has nothing to do with crashes or faults. That cast from void* to int* most likely corresponds to no code on the HW at all -- types are in C only, not on the HW, so a cast is a reinterpretation at C level -- and no HW will crash on that cast (because there is not even code for it). You may think that an integer value in a register must be fine, right? No, because it's not about pointers actually being integers in registers on your HW, but your C program is broken by definition if the cast pointer is unaligned.

thomashabets21mo ago

Author here.

> an unaligned pointer in itself is UB

Yup. Per the "Actually, it was UB even before that" section in the post.

> UB is not on the HW, it has nothing to do with crashes or faults

Yeah. I tried to convey this too, but I'm also addressing the people who say "but it's demonstrably fine", by giving examples. Because it's not.

account421mo ago

Which is totally fine and expected for any decent programmer. Casting pointers is clearly here be dragons territory.

simonask1mo ago

Many, many programmers come to C (and C++) with a lower-level understanding that actually gets in the way here. They understand that all types "are" just bytes and that all pointers "are" just register-sized integer addresses, because that's how the hardware works and has worked for decades.

It's perfectly reasonable to expect any load through `int*` to just load 4 bytes from memory, done and done. They get surprised that it is far from the whole story, and the result is UB.

Meanwhile, the actual computers we have been using for decades have no problems actually just loading 4 bytes through any arbitrary pointer with zero overhead. But no.

6 more replies

array_key_first1mo ago

Yes but casting pointers is virtually required in any non-trivial C program, and frankly even a lot of the trivial ones, because there's no other way to do type erasure or generics. Well, there kind of are now, and there's always been macros, but void * has historically been the predominant way this is done at runtime.

2019841mo ago

>an unaligned pointer in itself is UB, not only an access to it.

Can someone point to where the standard states this?

1 more reply

imtringued1mo ago

The problem with C UBI is that originally it meant the compiler has the freedom to map your code to the hardware inspite of machine instructions differing slightly between one another. The same C program may express different behaviour depending on which architecture it is running on.

This type of UB is fine and nobody really complains about hardware differences leading to bugs.

However, over time aggressive readings of UB evolved C into an implicit "Design by Contract" language where the constraints have become invisible. This creates a similar problem to RAII, where the implicit destructor calls are invisible.

When you dereference a pointer in C, the compiler adds an implicit non-nullable constraint to the function signature. When you pass in a possibly nullable pointer into the function, rather than seeing an error that there is no check or assertion, the compiler silently propagates the non-nullable constraint onto the pointer. When the compiler has proven the constraints to be invalid, it marks the function as unreachable. Calls to unreachable functions make the calling function unreachable as well.

jcranmer1mo ago

> The problem with C UBI is that originally it meant the compiler has the freedom to map your code to the hardware inspite of machine instructions differing slightly between one another. The same C program may express different behaviour depending on which architecture it is running on.

You're conflating undefined behavior with implementation-defined behavior. If it was only to do with what we think of as normal variance between processors, then it would be easy to make it implementation-defined behavior instead.

The differentiating factor of undefined behavior is that there are no constraints on program behavior at that point, and it was introduced to handle cases where processor or compiler behavior cannot be meaningfully constrained. One key class is of course hardware traps: in the presence of compiler optimizations, it is effectively impossible to make any guarantees about program state at the time of a trap (Java tried, and most people agreed they failed); but even without optimizations, there are processors that cannot deliver a trap at a precise point of execution and thus will continue to execute instructions after a trapping instruction.

stilley21mo ago

Does that mean that if I have a struct with #pragma pack(push, 1) I can't use pointers to any members that don't happen to be aligned?

saagarjha1mo ago

This is a non-standard extension, so your compiler may provide stronger guarantees.

1 more reply

tovej1mo ago

But that seems obvious. You can't load an integer from an unaligned address.

It's not only C-level is it. There's no (guarantee across architectures for) machine code for that either.

codeflo1mo ago

> You can't load an integer from an unaligned address.

You can, and the results are machine specific, clearly defined and well-documented. Ancient ARM raises an exception, modern ARM and x86 can do it with a performance penalty. It's only the C or C++ layer that is allowed to translate the code into arbitrary garbage, not the CPU.

1 more reply

matheusmoreira1mo ago

Sure you can. In many architectures it works just fine. Works perfectly in x86_64, for example. It's just a little slower.

1 more reply

mbel1mo ago

Unless your code targets some exotic architecture, like idk x86.

1 more reply

pjc501mo ago

You missed the point: the pointer existing as a value of that type at all is UB, even if you never try to access anything through it and no corresponding machine code is ever emitted.

1 more reply

bestouff1mo ago· 13 in thread

The problem of UB is not really that it may crash in some architecture. The real problem is that the compiler expects UB code to NOT happen, so if you write UB code anyway the compiler (and especially the optimizer) is allowed to translate that to anything that's convenient for its happy path. And sometimes that "anything" can be really unexpected (like removing big chunks of code).

inkysigma1mo ago

One example along this path as an example is that every function must either terminate or have a side effect. I don't think one has bitten me yet but I could completely see how you accidentally write some kind of infinite loop or recursion and the function gets deleted. Also, bonus points for tail recursion so this bug might only show up with a higher optimization level if during debug nothing hit the infinite loop.

marcosdumay1mo ago

There is that famous example where when you write an infinite loop last thing in your main, a function that you never called runs instead.

account421mo ago

Infinite loop without side effects == program stuck and not responding on user input and not outputting anything. That's not something a useful program will ever want to do.

4 more replies

17186274401mo ago

That's only true in C++ though, not in C.

1 more reply

eru1mo ago

Yes, a crash is about the most benign UB: at least it's highly visible.

In worse scenarios, your programme will silently continue with garbage, or format your hard disk or give attackers the key to the kingdom.

17186274401mo ago

Yes, that is a problem, but this is also the most useful feature and reason for UB. People that suggest to just define it or make it unspecified, miss, that the compiler being able to remove whole parts of a program is the point. When I write code, that is UB for certain inputs, it is because I do not intend the program to have any behaviour for these inputs. I do want the compiler to optimize those away or do anything that effects from the behaviour of the other defined cases. It is deeply satisfying to add some conditions triggering log strings and see that they do not occur in the binary, because they can be only reached via UB.

rando12341mo ago

The point in the article that 'It's not about optimisations' really got my attention. I've previously done some work where we wrote an analysis pass under the assumption that it executed last in the transformation pipeline and this was needed for correctness. The assumption was that since no further optimisations happened it was safe. Now I'm not so sure...

account421mo ago

That's a feature, not a problem.

anilakar1mo ago

Removing code paths that the programmer has explicitly laid out in the source code should be made a hard compile error unless the operation has been tagged with an attribute (anyone who wants to add the unsafe keyword to C? ).

Another commenter suggested using LLMs, but I disagree. Having clangd emit warning squiggles for unchecked operations (like signed addition) would be a good start.

flohofwoe1mo ago

> Removing code paths that the programmer has explicitly laid out in the source code should be made a hard compile error unless the operation has been tagged with an attribute (anyone who wants to add the unsafe keyword to C? ).

Dead code elimination is essential for performance, especially when using templates (this is basically what enables the fabled "zero cost abstraction" because complex template code may generate a lot of 'inactive' code which needs to be removed by the optimizer).

The actual issue is that the compiler is free to eliminate code paths after UB, but that's also not trivial to fix (and some optimizations are actually enabled by manually injecting UB (like `__builtin_unreachable()` which can make a measurable difference in the right places).

2 more replies

amoss1mo ago

Dead code elimination is run multiple times, including after other optimizations. So code that is not initially dead may become dead after propagating other information. Converting dead code into an error condition would make most generic code that is specialized for a particular context illegal.

4gotunameagain1mo ago

This is trickier than it initially seems. Using preprocessor directives to include or exclude swaths of code is a very common thing, and implementing a compiler error as you described would break the building of countless C codebases.

gpderetta1mo ago

Consider:

   enum op_t{ add, mul };
   int exec(op_t op, int a, int b) {
       if(op == add) { return a+b; }
       if(op == mul) { return a\*b; }
   }

   c = exec(add, a,b);

Should be the compiler be prevented from inlining exec and constant-propagating op and removing the mul branch? What about if a and b are constants and the addition itself is optimized away?

greysphere1mo ago· 9 in thread

The examples aren't really undefined behavior. They are examples that could become UB based on input/circumstances. Which if you are going to be that generous, every function call is UB because it could exceed stack space. Which is basically true in any language (up to the equivalent def of UB in that language). I feel like c has enough actual rough edges that deserve attention that sensationalism like this muddies folks attention (particularly novices) and can end up doing more harm than good.

guerby1mo ago

Ada 83 has no UB on call stack overflow, from the reference manual :

http://archive.adaic.com/standards/83lrm/html/lrm-11-01.html

"STORAGE_ERROR This exception is raised in any of the following situations: (...) or during the execution of a subprogram call, if storage is not sufficient."

veltas1mo ago

So it's just as useful as when your stack area ends with a page that will segfault on access, or your CPU will raise an interrupt if stack pointer goes beyond a particular address?

It's not safe though because throwing an exception, panicking, etc, is still a denial of service. It's just more deterministic than silently overwriting the heap instead. If the program is critical then you need to be able to statically prove the full size of the stack, which you can do with C and C++ with the right tools and restrictions.

4 more replies

eru1mo ago

That's not true at all.

First, you can define what happens when stack space is exceeded. Second not all programs need an arbitrary amount of stack space, some only need a constant amount that can be calculated ahead of time. (And some languages don't use a stack at all in their implementations.)

Your language could also offer tools to probe how much stack space you have left, and make guarantees based on that. Or they could let you install some handlers for what to do when you run out of stack space.

stevenhuang1mo ago

The examples are unequivocally UB. Full stop.

How to think of this properly is that when you have UB, you are no longer under the auspices of a language standard. Things may work fine for a time, indefinitely even. But what happens instead is you unknowingly become subject to whimsies of your toolchain (swap/upgrade compilers), architecture, or runtime (libc version differences).

You end up building a foundation on quicksand. That's the danger of UB.

flohofwoe1mo ago

> The examples are unequivocally UB. Full stop.

Tbh, already the first example (unaligned pointer access) is bogus and the C standard should be fixed (in the end the list of UB in the C standard is entirely "made up" and should be adapted to modern hardware, a lot of UB was important 30 years ago to allow optimizations on ancient CPUs, but a lot of those hardware restrictions are long gone).

In the end it's the CPU and not the compiler which decides whether an unaligned access is a problem or not. On most modern CPUs unaligned load/stores are no problem at all (not even a performance penalty unless you straddle a cache line). There's no point in restricting the entire C standard because of the behaviour of a few esoteric CPUs that are stuck in the past.

PS: we also need to stop with the "what if there is a CPU that..." discussions. The C standard should follow the current hardware, and not care about 40 year old CPUs or theoretical future CPU architectures. If esoteric CPUs need to be supported, compilers can do that with non-standard extensions.

4 more replies

greysphere1mo ago

The first example is dereferencing an integer pointer. That is a valid operation. Now if that pointer isn't valid (and being unaligned is one of many reasons it could be invalid) then calling the function with that invalid pointer will be UB.

An honest discussion would be something more like 'dereferencing pointers can lead to UB on invalid pointers. Here are N examples of that. Maybe avoid using pointers. Maybe consider how other languages avoid pointers. Maybe these shouldn't be UB and instead some other class of error.' And then even more honest discussion would present the upsides of having pointers and the upsides of having these errors be UB.

Instead, the article (and your comment) take this valid operation and presents it as invalid. Imagine you're a new programmer, you are just starting to wrap your head around pointers and you stumble across this article. You see the first example and it looks exactly what you would expect a dereference to look like. But the article claims it's wrong, and now you're confused. So you dig into the article more closely and are exposed to all these terms like UB, alignment, type coercion etc and come away more confused and scared and disinclined to understand pointers. This is classic FUD. This is a technique to manipulate, not educate.

Pointers have pros and cons. UB has pros and cons. Let's try to educate people about them.

1 more reply

pjc501mo ago

UB based on input can be an exploit vector.

layer81mo ago

Unvalidated input can always be an exploit vector.

1 more reply

account421mo ago

Yes, this article is pretty much the definition of FUD.

stackghost1mo ago· 9 in thread

Anyone who uses the construction "C/C++" doesn't write modern C++, and probably isn't very familiar with the recent revisions despite TFA's claims of writing it every day for decades.

Far from being just "C with classes", modern C++ is very different than C. The language is huge and complex, for sure, but nobody is forced to use all of it.

No HN comment can possibly cover all the use cases of C++ but in general, unless you have a very good reason not to:

- eschewing boomer loops in favor of ranges

- using RAII with smart pointers

- move semantics

- using STL containers instead of raw arrays

- borrowing using spans and string views

These things go a long way towards, shall we say, "safe-ish" code without UB. It is not memory-safe enforced at the language level, like Rust, but the upshot is you never need to deal with the Rust community :^)

veltas1mo ago

Although some people, like Bjarne Stroustrup, object to the term C/C++, it's a bit like Richard Stallman objecting to the term "Linux". The fact is it can mean "C or C++", and I wouldn't assume the author thinks they're the same, but they're talking about both of them together in the same sentence. This seems reasonable given this is about undefined behavior, and it's trivial to accidentally write UB-inducing code in C++ even with modern style (although I'd say you should catch most trivial cases with e.g. ubsan, and a lot of bad cases would be avoided with e.g. ranges, so I think the article is exaggerating the issue).

stackghost1mo ago

Well, the author explicitly refers to "C/C++" as one language:

>After all, C/C++ is not a memory safe language.

1 more reply

thomashabets21mo ago

Author here.

In the context of UB discussion, the arguments apply equally to C and C++.

How would you write that?

I entirely agree with all your points that C and C++ are completely different languages at this point. And yet I wanted to write this post about something that is true for both.

rectang1mo ago

> the upshot is you never need to deal with the Rust community

In the end, everything comes down to culture war.

stackghost1mo ago

Perhaps we should rewrite our culture in Rust.

jim334421mo ago

You can write C++ in a way that's similar to C if you want and run into some of the same UB. Normally I don't like the "C/C++" thing, but in this context it makes sense.

SpaceNugget1mo ago

I totally agree that modern c++ is pretty robust if you are both a well seasoned developer and only stick to a very blessed subset of it's features and avoid the historical baggage.

However, that's obviously not the point? Ignoring the idea that people can/should just "git gud" and write perfect code in a language with lots of old traps, you can't control how everyone else writes their code, even on your own team once it gets big enough. And there will always be junior devs stumbling into the bear traps of c/c++ (even if the rest of the codebase is all modern c++). So no matter how many great new features get added to C++, until (never) they start taking away the bad ones, the danger inherent to writing in that language doesn't go away.

Also, safe != non-UB. TFA isn't so much about memory safety anyway.

m-schuetz1mo ago

C/C++ is a perfectly fine term for C or C-style C++. The languages can be very close, and personally I prefer C-style C++ miles over some of the half-baked modern nonsense. I mean, I do use C++23 since it has some great additions, but I'm ditching like 90% of the stuff that only adds complexity without much benefit.

flohofwoe1mo ago

"C/C++" is still a useful term for the common C/C++ subset :)

As far as stdlib usage is concerned: that's just your opinion. The stdlib has a lot of footguns and terrible design decisions too, e.g. std::vector pulling in 20k lines of code into each compilation unit is simply bizarre.

Also:

- eschewing boomer loops in favor of ranges

Those "boomer loops" compile infinitely faster than the new ranges stuff (and they are arguably more readable too): https://aras-p.info/blog/2018/12/28/Modern-C-Lamentations/

- borrowing using spans and string views

Those are just as unsafe as raw pointers. It's not really "borrowing" when the referenced data can disappear while the "borrow" is active.

dmitrygr1mo ago· 8 in thread

I stoped reading about here:

    > bool parse_packet(const uint8_t* bytes) {
    >   const int* magic_intp = (const int*)bytes;   // UB!

Author, if you are reading this, please cite the spec section explaining that this is UB. Dereferencing the produced pointer may be UB, but casting itself is not, since uint8_t is ~ char and char* can be cast to and from any type.

you might try to argue that uint8_t is not necessarily char, and while it is true that implementations of C can exist where CHAR_BIT > 8, but those do not have uint8_t defined (as per spec), so if you have uint8_t, then it is "unsigned char", which makes this cast perfectly safe and defined as far as i can tell. Of course CHAR_BIT is required to be >= 8, so if it is not >8, it is exactly 8. (In any case, whether uint8_t is literally a typedef of unsigned char is implementation-defined and not actually relevant to whether the cast itself is valid -- it is)

raphlinus1mo ago

The issue is not type punning (itself a very common source of UB), but the fact that the `bytes` pointer might not be int-aligned. The spec is clear that the creation (not just the dereferencing) of an unaligned pointer is UB, see 6.3.2.3 paragraph 7 of the C11 (draft) spec.

Of course, this exchange just demonstrates the larger point, that even a world-class expert in low level programming can easily make mistakes in spotting potential UB.

flohofwoe1mo ago

> Of course, this exchange just demonstrates the larger point, that even a world-class expert in low level programming can easily make mistakes in spotting potential UB.

A "world-class expert in low level programming" knows that unaligned memory accesses are no problem anymore on most modern CPUs, and that this particular UB in the C standard is bogus and needs to fixed ;)

1 more reply

gritzko1mo ago

C of course is ancient. It remembers the Cambrian explosion of CPU architectures, twelve-bit bytes and everything like that. I wonder if it is possible to codify some pragmatic subset of it that works nicely on currently available CPUs. Cause the author of the piece goes back in time to prove his point (SPARCs and Alphas).

1 more reply

dmitrygr1mo ago

That cast is valid. Spec does not guarantee same bit sequence for resulting pointer and source pointer. But as the cast is explicitly allowed, it is not UB. Compiler is free to round the pointer down. Or up. Or even sideways. All ok. Dereferencing it — indeed not ok. But the cast is explicitly allowed and not UB.

Pointer casts changing pointer bit sequences is common on weird platforms (eg: some TI DSPs, PIC, and aarch64+PAC). And it is valid as per spec. Pointer assignment is not required to be the same as memcpy-ing the pointer unto a pointer to another type.

You misunderstood the spec. No promises are made that that cast copies the pointer bit for bit (and thus creates an invalid pointer). Therefore, your objection to invalid pointers is null and void. :)

1 more reply

thomashabets21mo ago

Author here.

> A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned71) for the referenced type, the behavior is undefined.

C23 6.3.2.3p7.

stevenhuang1mo ago

Byte and int has different alignment requirements. It is UB the moment you make such a ptr.

Great way to demonstrate the point of the article.

gritzko1mo ago

That better be marked "historical". At least, Lemire says:

On recent Intel and 64-bit ARM processors, data alignment does not make processing a lot faster. It is a micro-optimization. Data alignment for speed is a myth. // https://lemire.me/blog/2012/05/31/data-alignment-for-speed-m...

(while in the olden days, a program may crash on unaligned access, esp on RISC)

1 more reply

dmitrygr1mo ago

Without memcpy there is no guarantee that that line produces an invalid pointer

I don’t see what spec part would prohibit that cast from validly compiling to

   BIC r3, r0, #3

Spec only guaranteed round-trip through char* of properly aligned for type pointers. This doesn’t break that.

__0x011mo ago· 7 in thread

> A problem with this is that in order to confirm the findings, you’ll need an expert human. But generally expert humans are busy doing other things.

The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.

LLM generated code will eventually contain UB.

EDIT: added "eventually"

flohofwoe1mo ago

It would already help a lot when the C and C++ standards start to clean up the list of Undefined Behaviour (e.g. there's a lot of nonsense UB currently in the C standard which could easily become Defined Behaviour - like the "file doesn't end in a new-line character" thing):

https://gist.github.com/Earnestly/7c903f481ff9d29a3dd1

jcranmer1mo ago

The C committee is cleaning up a lot of UB (check https://www.open-std.org/jtc1/sc22/wg14/www/wg14_document_lo... for paper titles like "slaying earthly demons").

But don't misunderstand the goal of that: C and C++ will never get rid of UB. The result of dereferencing an invalid pointer is UB, will always remain UB, and really cannot be anything other than UB.

layer81mo ago

The easy cases like you cite are also those that don’t cause problems in practice. I’m not sure that would help all that much, other than to slightly reduce internet criticism.

1 more reply

thomashabets21mo ago

Author here.

> The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.

Yup. But the point of the article is that even expert humans cannot do this alone. And as I wrote, LLM+junior won't suffice either. We need LLM+senior experts.

And it's a problem that we have way more existing UB than expert capacity.

Now, will LLMs and experts both miss UB in some cases? Of course. There's no 100% solution. But LLMs, I claim, will find orders of magnitude more, with low false positive, than any expert. Even if these expert humans (like in the OpenBSD case for the two bugs I found, one of which was UB) are given more than three decades to do it.

I didn't even use the best model, complex code target, or time. I just wanted to choose a target that has a high chance of having very good experts already having audited it.

eru1mo ago

Our LLM powered coding assistance are pretty good at doing lots of busywork that doesn't require all that much smarts. So they can supervise running our UB checks, like Valgrind, and making the linters happy.

lelanthran1mo ago

> LLM generated code will eventually contain UB.

Yes.

Even in languages other than C (i.e. you will get behaviour that nothing in the input specified).

When LLMs generate code, all languages have UB.

eru1mo ago

That's a bit silly.

UB means literally no restrictions. So if you standard says 'you have to crash with an error message' that's already no longer UB.

1 more reply

amiga3861mo ago· 7 in thread

Can anyone explain why this is undefined behaviour? UBSan calls it "indirect call of a function through a function pointer of the wrong type"

    struct foo {int i;};
    int func(struct foo *x) {return x->i;}
    int main() {
        int (*funcptr)(void*) = (int (*)(void*)) &func;
        struct foo foo = { 42 };
        return funcptr(&foo);
    }

While this is all kosher per the language lawyers:

    struct foo {int i;};
    int func(void *x) {return ((struct foo *)x)->i;}
    int main() {
        int (*funcptr)(void*) = &func;
        struct foo foo = { 42 };
        return funcptr(&foo);
    }

jcranmer1mo ago

C23 §6.5.2.2p7

> If the function is defined with a type that is not compatible with the type (of the expression) pointed to by the expression that denotes the called function, the behavior is undefined.

Compatible types requires integrating texts from several different paragraphs, but the general notion is "identical type, in a frontend sense", not "same ABI." This means that "const void " and "void " are not compatible types, much less "void " and "struct foo ".

amiga3861mo ago

I get that it's defined that way, but I'd really like to know why.

I can see the value in saying that struct x* isn't compatible with struct y*, because they could have different alignment or packing rules. But struct x* and void*, which is already special-cased to allow assignment without a cast? Why aren't these considered compatible in function pointer parameter definitions?

Is there any work involved in casting void* to struct* (on any architecture) that a plain function pointer would miss out?

1 more reply

wavemode1mo ago

It's undefined behavior due to the "strict aliasing" rule. You're simply not allowed to cast one pointer type to another (ever!) except for the following exceptions:

- casting an object pointer to or from void*

- casting an object pointer to or from char*

You're not doing either of those things. A function pointer is not an object pointer (the standard does not guarantee that the two kinds of pointer even have the same size/representation, and in fact on some esoteric hardware they don't), and even if it were, you aren't casting to or from void* or char*. So it's UB for two separate reasons.

jcranmer1mo ago

Sorry, this explanation is plain wrong.

You can cast between pointer types freely so long as they can be representable in one another (some casts are undefined because the address would be unaligned in the target pointer type, and there's actually no guarantee that pointers to objects and pointers to functions have the same representation).

Strict aliasing rules don't kick in at pointer type casting, but rather kick in at lvalue access--when you dereference a pointer, in other words--and you've also given the list of strict aliasing rules completely incorrectly.

j16sdiz1mo ago

Two function pointer (in practice) compatible or not depends on machine specific calling convention.

I guess enumerating all the possibility is just .. don't look right? make the standard too long and complex?

tomp1mo ago

Casting to a pointer of incompatible type is UB. The exception is casting to char*.

amiga3861mo ago

Tell me why struct* is incompatible with void* when it's such a standard case in C that you don't need a cast:

    struct foo *x = malloc(sizeof(struct foo)); /* malloc returns void* */

Or rather, tell me why the C11 standards committee decided to declare that struct* is incompatible with a void*

1 more reply

veltas1mo ago· 6 in thread

From the ANSI C standard:

  3.16 undefined behavior: Behavior, upon use of a nonportable or erroneous program construct, of erroneous data, or of indeterminately valued objects, for which this International Standard imposes no requirements.  Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message).

Is it just me or did compiler writers apply overly legalistic interpretation to the "no requirements" part in this paragraph? The intent here is extremely clear, that undefined behavior means you're doing something not intended or specified by the language, but that the consequence of this should be somewhat bounded or as expected for the target machine. This is closer to our old school understanding of UB.

By 'bounded', this obviously ignores the security consequences of e.g. buffer overflows, but just because UB can be exploited doesn't mean it's appropriate for e.g. the compiler to exploit it too, that clearly violates the intent of this paragraph.

dataflow1mo ago

> but that the consequence of this should be somewhat bounded or as expected for the target machine.

Aren't "unpredictable results" and "no requirements" contrary to the idea that the behavior would be "somewhat bounded"?

veltas1mo ago

Notice though "ignoring the situation" thru "documented manner characteristic of the environment". Even though truly you can read this in an uncharitable way, you could also try and understand the intent of this paragraph, and I think reading it for its intents is always the best way to interpret a language standard when the wording is ambiguous or soft, especially if you're writing a compiler.

I don't think you could sincerely argue that this definition intends to allow the compiler to totally rewrite your code because of one guaranteed UB detected on line 5, just that it would be good to print a diagnostic if it can be detected, and if not to do what's "characteristic of the environment". Does that make sense?

3 more replies

thomashabets21mo ago

Author here.

I touched on this in the "it's not about optimizations" section. It's not the compiler is out to get you. It's that you told it to do something it cannot express.

It's like if you slipped in a word in French, and not being programmed for French, it misheard the word as a false friend in English. The compiler had no way to represent the French word in it's parse tree.

So no, it's not overly legalistic. Like if the compiler knows that this hardware can do unaligned memory access, but not atomic unaligned access, should it check for alignment in std::atomic<int> ptr but not in int ptr? Probably not, right?

veltas1mo ago

It's not that your article specifically discusses this aspect, but I think it's an important part of the conversation that's being overlooked by commentators, that we've twisted the original intent of UB and made unnecessary work for ourselves. There's been too much scaremongering about UB that's gone beyond the real concerns. If you only fear UB and don't understand it then you are worse off for trying to write safe C or C++.

17186274401mo ago

The behaviour is bounded by the capability of your machine. It is unlikely that your desktop computer launches a nuclear missile, unless you worked for it to be able to do that.

lelanthran1mo ago

> Is it just me or did compiler writers apply overly legalistic interpretation to the "no requirements" part in this paragraph?

I've (fruitlessly) had this discussion on HN before - super-aggressive optimisations for diminishing rewards are the norm in modern compilers.

In old C compilers, dereferencing NULL was reliable - the code that dereferenced NULL will always be emitted. Now, dereferencing NULL is not reliable, because the compiler may remove that and the program may fail in ways not anticipated (i.e, no access is attempted to memory location 0).

The compiler authors are on the standard, and they tend to push for more cases of UB being added rather than removing what UB there is right now (for exampel, by replacing with Implementation Defined Behaviour).

debugnik1mo ago· 4 in thread

As much as I agree with the intro, these examples aren't good and the overall article is just a veil for pushing LLM coding.

gblargg1mo ago

Agreed. One after another these are standard things you avoid when writing portable code (or don't need, like accessing the object at address 0). They come across like from someone who wants to write whatever they want and have it work the same on everything. To make it into a language that allows this would remove its advantage of being able to write to the platform when you want to.

boxed1mo ago

Not good how? Are they TRUE? If so that's super bad.

IshKebab1mo ago

They are true but I agree it's not a great article. C has an unending list of UB and given the title I was expecting a more comprehensive survey, but they actually just picked a few that are both fairly well known and not very interesting.

1 more reply

HelloNurse1mo ago

Some of the examples are somewhat formally true in theory and bullshit in practice; some are quite hallucinatory.

  - Creating a potentially troublesome misaligned int pointer is a precisely localized and completely explicit user mistake, not something that just happens because it's C.
  - Passing signed char to character classification functions that expect an unsigned char (disguised as an int) is a very specific dumb user error. The C standard could specify that all negative inputs, including EOF and invalid signed char values, are classified as not belonging to the character class, but I doubt the current undefined behaviour in isxdigit() etc. implementations ever went beyond accepting invalid inputs.
  - Casting floating point values to integer values in general requires taking care of whether the FP values are small enough to be represented and what to do with NaN and Inf values: not the language's responsibility. C offers a toolbox of tests, not ready-made application specific error handling.
  - Expecting C to handle "address zero" in physical memory in ways that conflict with NULL in source code denotes a complete lack of understanding of what a program is. Where stuff in an executable is loaded in memory, in the rare cases when it matters, can surely be affected with platform specific extensions, possibly at the level of linker commands with nothing appearing in the C source code.

1 more reply

weinzierl1mo ago· 4 in thread

A fun one that'd fit list be sequence point violations like

    i = i++

radiospiel1mo ago

Fun, sure, but also GCC and Clang will both warn with -Wall (-Wsequence-point / -Wunsequenced).

account421mo ago

This would also be a code smell even if it was well defined.

marcosdumay1mo ago

Yes, it should be an explicit error. Not undefined.

leni5361mo ago

Only in C, that one is defined in C++.

edit: I'm not sure it's even undefined in C.

akiarie1mo ago· 4 in thread

C is still, by far, the simplest language that we have.

Although many newer languages are safer (with the exclusion of Rust, primarily by being slower) the same kinds of issues that are there in C are there in these languages, their effects are just harder to see.

People complain about C as though they know how to fix it.

simonask1mo ago

C is not a simple language in the sense that writing software in C is simple, and I think that's the only useful way to understand the word "simple" in this context.

Brainfuck is "simple" by any other definition as well, but that's not a useful quality.

spacedcowboy1mo ago

C is a far simpler language than, for example, Swift. It's cognitive load in order to actually write something is pretty small - even the authors state that their book about C is intentionally slim because the concepts to understand are not that many.

That doesn't mean the C is a safer language than Swift, or a less-capable language than Swift. But in terms of "easy to understand along the happy-path", it's a lot easier to get going in C.

Swift, for example, bakes a whole load of CS-degree-level ideas and concepts into the basic language with its optionals, unwrapping, type-inference, async/await, existential types, ... ... ... . C doesn't do any of that. There are (many!) more footguns in C, but the language is less complex as a result.

Brainfuck is not at all simple, from that point of view. This is a valid Brainfuck program:

>+++++++++[<++++++++>-]<.>+++++++[<++++>-]<+.+++++++..+++.[-]>++++++++[<++++>-]<. >+++++++++++[<+++++>-]<.>++++++++[<+++>-]<.+++.------.--------.[-]>++++++++[<++++ >-]<+.[-]++++++++++.

This is the equivalent C program

#include <stdio.h> int main() { printf("Hello world!\n"); }

One of these is far simpler than the other.

[edit: changed to make the examples do the same thing]

1 more reply

jeroenhd1mo ago

C sits right in the middle between assembly and BASIC in terms of simplicity. You can't do a simple popcnt, but you can implement jump tables.

It's slower than Fortran and, depending on the platform, cobol. It's a bigger minefield than any language that came after it barring C++.

The only real advantage I can ascribe to C is that it's actually still being used after all these years, and it mostly works similarly on most hardware, like a Java for people who enjoy the casino.

Fixing C without breaking existing C code is pretty much impossible. You can start by defining warnings for UB, but then you will break any of the more trivial examples in the article. You can also start by simply killing off weird platforms (force a specific amount of bits for instance, screw the weird 16 bit char chips). Making casts explicit would probably fix a lot of problems too, though you'd need better syntax for that.

There is no fixing C without changing what C really is.

dns_snek1mo ago

Can you elaborate what do you think C has in terms of simplicity that Zig doesn't, and which "same kinds of issues" do you think it has?

I'm not an expert in either language but my anecdotal experience disagrees with this - writing Zig has been far simpler and less error-prone than writing C.

elnatro1mo ago· 4 in thread

Is there a way to avoid undefined behavior Im C then? Could we write a new C compiler that adds some checks and fixes (e.g. raise documented exceptions) to each undefined behavior?

u1hcw9nx1mo ago

That post is just a hyperbolic rhetorical piece, not even a good technical shade. There are plenty of tools that restrict C into defined behavior subset. HN is just not aware of them. NASA, Aerospace and car industry are big customers, static analyzers and compilers.

Good open source ones:

Frama-C

IKOS (from NASA)

elnatro1mo ago

It’s been a while since I programmed in C. Thank you for these resources.

saagarjha1mo ago

Not all of them but there are many tools that can try to define behavior for this code to help shake them out of your codebase.

peterfirefly1mo ago

ubsan.

Doesn't catch all of it.

1 more reply

maple31421mo ago· 3 in thread

Is this a correct understanding of UB in C? A program P has a set of inputs A that do not trigger UB, and a complementary set of inputs B that do trigger UB. A correct compiler compiles P into an executable P'. For all inputs in A, P' should behave the same as P. However, for any input in B, the is absolutely no requirements on the behavior of P'.

simonask1mo ago

Intuitively yes - the program will be compiled as if B-inputs are never passed to the program, and that can include eliminating code that tries to detect B-inputs.

mbrock1mo ago

This is a description of an imaginary compiler, evoked by the ANSI/ISO standards documents, which has never existed and will never exist. To understand what the program will do, you just have to understand the compiler behavior on your target platforms. A helpful intuition pump is: imagine the ANSI/ISO specifications simply do not exist; now what? Well, you just continue your engineering practice, the way you would for any of the myriad languages that never even had a post hoc standards document.

4 more replies

17186274401mo ago

Yes, that's a good summary.

raluk1mo ago· 3 in thread

In C / C++ there are two kinds of undefined behaviour. One is where there is written in standard what UB is. Another one is everthing else that is not in standard.

thaumasiotes1mo ago

Technically, that's only one kind, because it's written in the standard that anything not mentioned in the standard is undefined behavior.

cepepe1mo ago

One kind, but two different classes of undefined behaviour.

wiseowise1mo ago

https://en.wikipedia.org/wiki/There_are_unknown_unknowns

jraph1mo ago· 3 in thread

Yet another push to use LLMs after casting fear. Now it should be illegal not to use LLMs. A good start of the day.

(I hope casting fear is not UB)

raverbashing1mo ago

> (I hope casting fear is not UB)

I'm sure that's UB in C

In C++ just use <reinterpret_cast>

wg01mo ago

The irony is unmistakable.

stevenhuang1mo ago

There is nothing ironic in letting an llm have a pass at identifying potential UB and other correctness issues in C code.

I say this as an experienced C developer.

1 more reply

jb19911mo ago· 2 in thread

Some of the C++ code in this article has not been idiomatic in over a decade, and would be considered a code smell today. The language has evolved into quite a different language than when it was first created. As soon as I saw all of those raw pointers and direct pointer access, it was clear that at least part of this article should be taken with a grain of salt.

The other obvious issue with the overall perspective is that C and C++ are being thrown together directly as if somehow they’re nearly the same language, but they are really very far apart nowadays.

debugnik1mo ago

I was about to call out that the code is supposed to be C and not C++, but I double checked and I realised it actually says std::atomic<int>, not atomic_int!

jb19911mo ago

Exactly, this is very old C++ on display in this article. It’s certainly not as safe as a language like Rust, but quite a lot of undefended behavior and things that will shoot yourself in the foot have been changed over the last 10 years.

Most C++ today will be immediately obvious and not accidentally mixed up with C.

1 more reply

pizlonator1mo ago· 2 in thread

The problem is incorrectly assuming that the spec is meaningful in some kind of rigorous way.

It’s not. All that matters is what C compilers actually do and what real C programs expect.

This is a good thing. It creates a culture where the two sides meet each other where they’re at

BearOso1mo ago

We also have a very limited number of compilers and a small number of prevalent architectures today. As long as you know the behavior of the target compiler and architecture, the behavior is defined, it's just not specified.

pizlonator1mo ago

This is true.

But why I’m saying has always been true. What has changed is that the effective portability of C and C++ code has increased due to the reduction in number of compilers and arches

mjs011mo ago· 2 in thread

Integer promotion seems to be the source of many signed integer overflow UB. Why does C have it? Does integer promotion ever have a good part?

saagarjha1mo ago

Yes, it simplifies a lot of code that would otherwise be littered with casts.

peterfirefly1mo ago

Could be fixed by having a nicer casting syntax (like Rust) or by not having so damn many scalar types that are used in practice.

"Explicit casts only" worked fine in Modula-2, which doesn't have as many scalar types.

1 more reply

codeflo1mo ago· 2 in thread

> The compiler, and really the underlying hardware too, is playing a game of telephone with your UB intentions.

The part about hardware is wrong BTW. In all the cases about null pointers and out-of-bounds access and integer overflow and whatnot, the hardware semantics are clearly defined, and the assembler code does exactly what is written. The way modern compilers act on your code makes C less safe than assembler in that sense.

thomashabets21mo ago

Author here

> The part about hardware is wrong BTW

Could you be more specific? I think by "wrong" you may mean "not actually relevant to UB", and you're right about that. If that's what you mean then that part is not for you. It's for the "but it's demonstrably fine" crowd.

> the hardware semantics are clearly defined

Yup. The article means to dive from the C abstract machine to illustrate how your defined intentions (in your head), written as UB C, get translated into defined hardware behavior that you did not intend.

I'm not saying the CPU has UB, and I wonder what part made you think I did.

That's what I mean game of telephone. The UB parts get interpreted as real instructions by the hardware, and it will definitely do those things. But what are those things? It's not the things you intended, and any "common sense" reading of the C code is irrelevant, because the C representation of your intentions were UB.

codeflo1mo ago

It seems like I simply misunderstood the point of the "game of telephone" metaphor. To be honest, even with your added explanation, I don't fully get why you express it that way. But I think we're in agreement on the substance, and I shouldn't have worded my response so harshly.

fjfaase1mo ago· 2 in thread

Is comparing a signed integer with an unsigned integer UB? I resently wrote some code and compiled it with gcc to x86_64 (without optimization) that returned an incorrect answer.

Karliss1mo ago

No UB, but the integer promotions rules apply.

When comparing signed and unsigned integers of same size the signed one will be converted to unsigned. In a reasonably configured project compiler will warn about it.

In case of integers smaller than int, promotion to int happens first.

In case of signed and unsigned integers of different size, the smaller one will be converted to bigger one.

benchloftbrunch1mo ago

It's not UB. Integer promotion applies, the signed int is implicitly coerced to unsigned (or the other way around - don't remember which.)

my-next-account1mo ago· 2 in thread

Hello, it's me. I'm not afraid of UB.

my-next-account1mo ago

To be honest, miscompilations because of UB is exceedingly rare, and we do a lot of weird shit in our code.

saagarjha1mo ago

You should be!

mbrock1mo ago· 2 in thread

most languages don't even HAVE a specification so in most languages literally EVERYTHING everything is undefined behavior

oersted1mo ago

UB doesn't mean that it is not specified (actually it is often very well specified), it means that compilers can and do assume that such code patterns will not be present. Those cases may not be considered and can lead to unexpected behaviour.

Additionally, some (most?) UB is intentionally UB so that optimisers are free to do fancy tricks assuming that certain cases will never happen. Indeed, this is required for high performance. If they do happen, again, it can lead to unexpected behaviour.

PS: Most languages that don't have a specification declare their primary implementation to be specification-as-code. Rust is an example of that, and it does still have UB: the cases that the compiler assumes will not happen.

mbrock1mo ago

undefined behavior is the behavior of code patterns "for which this International Standard imposes no requirements" and the behavior is in fact almost always predictable and agreed upon by compiler vendors and the users of the language, which is why you are able to use programs that rely on undefined behavior probably every single second you are using the computer

edit: for example I'm typing this into Safari which means probably every key press and event is going through JSC JIT compiled functions—which have, structurally and necessarily and intentionally, COMPLETELY undefined behavior according to the spec—and yet it miraculously works, perfectly, because the spec doesn't really matter

1 more reply

momo261mo ago· 2 in thread

Debugging in C is soooo hard. When I was writing Malloc Lab in system course, there were uncountable undefined and out of range :(

flohofwoe1mo ago

Yet, debugging memory corruption issues in C and C++ code with modern compiler toolchains and memory debugging tools is infinitely easier than 25 years ago.

(e.g. just compiling with address sanitizer and using static analyzers catch pretty much all of the 'trivial' memory corruption issues).

feelamee1mo ago

I think vice versa - C is so simply, that debugging it is just a pleasant walk.

Especially compared with modern languages with lambdas/exceptions/virtual functions and so on.

The one thing I see can make it harder is function pointers.

nullpwr1mo ago· 2 in thread

Excellent post. But it's addressed to the wrong people.

The problem lies with compilers, not with the language and its specification, or with the creators of the C programming language.

Anyone can write a compiler that transforms all undefined behaviors (UB) into defined behaviors (DB). And your compiler will be used by people, including me.

HarHarVeryFunny1mo ago

I'd say the unaligned pointer one is the language's fault. The language should not let you create an an invalid pointer, or at least warn you when you are doing so.

OTOH one could argue that creating truly portable programs is not possible since a programming language is a leaky abstraction - different machines have different endianness, different alignment requirements, different amounts of memory, etc. One could argue therefore that the language should not make any assumptions about the alignment restrictions, or lack of them, on the machine you are compiling for. Just document that "manually created" pointers may be unaligned and have machine-dependent behavior. A nice compiler could still generate a warning or error if you create a pointer that doesn't meet the alignment requirements of the target you are compiling for.

C/C++'s provision of type casts reflects that the language has made the design decision to not restrict the user, and let them step outside the bounds of any guarantees the language provides if they want to. Unions are also a form of type cast.

nullpwr1mo ago

> The language should not let you create an an invalid pointer, or at least warn you when you are doing so

completely agree!

1 more reply

rom1v1mo ago· 1 in thread

A concrete example of undefined behavior caused by an unaligned pointer: https://pzemtsov.github.io/2016/11/06/bug-story-alignment-on...

gblargg1mo ago

Specifically on x86 where it's assumed that won't cause problems.

rurban1mo ago· 1 in thread

Very bad advice. Of course good new LLM's know about UB, but you still need to use ubsan (ie - fsanitize=undefined), and not your LLM.

formerly_proven1mo ago

Coding agents write unsound Rust any day, too. unsafe impl Send … is much easier than fixing a bad design and it might even work momentarily.

lelanthran1mo ago· 1 in thread

I read through this in detail... Is it just me, or are these things that are invoked by intentionally bypassing the typing?

I mean, you have to go out of your way and use a cast to get the UB in the first example.

For the `isxdigit` implementation, using a parameter to index into an array without a length check is pretty suspect already. I don't think any of my code actually indexes an array without checking the length in some way.

For the float -> int conversion, converting a float to an int without picking a conversion does not make sense in the first place - math.h has rounding and ceiling functions.

> For all you know the compiler has no internal way to even express your intention here.

I'm human, not a compiler, and even I cannot tell what the intention is behind trying to call NULL as a function. What exactly is expected to happen?

> Because the argument needs to be a pointer, and the NULL macro may be misinterpreted as an integer zero.

I don't think this is true for C. The NULL macro is defined to be a pointer in the C standard, AFAIK. Just because comparisons with zero are allowed, does not imply that the standard implicitly promotes NULL to `int`.

I think only the final one is of note (the 24-bit shift assigned to a uint64_t).

account421mo ago

> I don't think this is true for C. The NULL macro is defined to be a pointer in the C standard, AFAIK. Just because comparisons with zero are allowed, does not imply that the standard implicitly promotes NULL to `int`.

Probably confusion with C++ where NULL is 0 which is a special case that can be implicitly cast to both integers and pointers, unlike non-zero constants. C doesn't need this because it doesn't require explicit casts from void pointers to others.

commandlinefan1mo ago· 1 in thread

A lot of this stems from trying to insist that char just means "small" and not "8 bits" and that int means "bigger than that" and not "32 bits". In fairness, K&R dealt with an era where 9 bit architectures existed, but char is 8 bits now. Everywhere.

jeroenhd1mo ago

In the world of microcontrollers, CHAR_BIT can be 16 or some other funky number. char is usually 8 bits in size, though.

danborn261mo ago· 1 in thread

The scariest part is how many production systems rely on undefined behavior without anyone knowing until a compiler update breaks everything.

coolThingsFirst1mo ago

Where does scary part come from, they run on planes?

tomcam1mo ago· 1 in thread

I fear I will be downvoted into oblivion but I also want to learn from this.

First let me state the case for C. It’s meant to be used as a systems language that’s as close to assembly as possible while remaining portable (compared to assembly). As such it’s the first high-level language developed for any new processor.

Given the above predicate: Isn’t everything described in the article as it should be?

Add too much to the language and it becomes less possible to implement on new architectures, right? Because the undefined behavior lets implementors stand up new compilers fairly quickly.

For less undefined behavior isn’t it better to use languages that have that in their DNA? D, Zig, Go, Java, etc?

vladms1mo ago

> Given the above predicate: Isn’t everything described in the article as it should be?

I think the real trick question is "as it should be for whom?".

Reading the comments I think people underestimate the complex interaction between:

- engineers that design hardware (they don't care much about the compiler, except when it has to fix their mistakes)

- engineers that do the compiler (they have to struggle with all quirks of the new architecture and all of the complaints of the users)

- users of the new system (hardware + compiler) that just want to take their 100k lines of code (libraries) and just use it on the new system with better performance (as that's what the hardware people promissed!)

- users working on one architecture all their lives

For the compiler people, yes, probably most what is described is as it should be. For the users (that care about performance and not making porting efforts), probably no.

Now, even when I was doing compiler work we had a hard time explaining our users why we couldn't do some things they wanted (while also improving performance and not changing code that was writting), so explaining that on the internet seems to me a lost battle.

I am sure there are things that can be improved, and standards evolve. But the problem is very complex given the sheer amount of code written and the strange architectures out there.

kajaktum1mo ago· 1 in thread

I want a language that is a group of bit (0,1) and the xor operator. Everything else is built on top of that.

benchloftbrunch25d ago

Xor isn't Turing complete sadly.

1 more reply

cracki1mo ago· 1 in thread

We know. This is not news.

boxed1mo ago

It seems to be to many many programmers who keep using C++

NooneAtAll31mo ago· 1 in thread

feels like https://xkcd.com/1499/

the only people complaining about being able to do awful things are people that do awful things

gritzko1mo ago

- a metal bar always sinks

- unless you are trying to sink it in mercury. then it floats

- unless it is an uranium bar

- go sink uranium bars in mercury yourself

nokeya1mo ago· 1 in thread

Ok, and?

wg01mo ago

"Rewrite everything in Rust. OMG universe is written in Rust so memory safe with zero allocations"

JonChesterfield1mo ago

Well, you can't write malloc in conforming C, which hurts rather more than remembering to write bitcast as memcpy on char pointers.

Doesn't matter though because you aren't writing standards conforming C. You're writing whatever dialect your compilers support, and that's probably (module bugs) much better behaved than the spec suggests.

Or you're writing C++ and way more exposed to the adversarial-and-benevolent compiler experience.

The type aliasing rules are the only ones that routinely cause me much annoyance in C and there's always a workaround, whether if it's the launder intrinsic used to implement C++, the may_alias attribute or in extremis dropping into asm. So they're a nuisance not a blocker.

sltr1mo ago

For a deep dive on UB with printf, see https://srs.fyi/see-conversions/

> When programming in C, to avoid unexpected pitfalls, one must be acutely aware of a whole slew of implicit behaviors (some of which are implementation-defined or even undefined).

bkallus1mo ago

> the OpenBSD project has not been very receptive in the past for bug reports, my sense of “this is probably fine, in practice”, and that if OpenBSD wants to weed out UB from their code base, then that’s a major project that should be done in a better way than me just being the middle man between the LLM and them for a patch here and there.

Part of the reason for all the UB in OpenBSD is that UBSan doesn't run on that platform. When I ported OpenBSD's httpd to Linux, I found that UBSan tripped before the server even came up because the config flag parsing shifts into the MSB of a signed integer.

I tried to contribute back a patch (just make the flag bitfield unsigned), but it was ignored. I think if UBSan ran natively on OpenBSD, then there would be a lot more of these patches, and the maintainers would have to take an official stance on whether they think these bugs matter.

psim11mo ago

I like the ideas of this article but would not use SPARC as a main badguy in my examples. A naive and probably popular takeaway would be, "Thank goodness I am not writing for SPARC and don't need to worry about these SPARC architectural concerns!"

casey21mo ago

And that's a good thing. UB is another mechanism to speed up the development of compilers, many other languages fall trap to over defining while we lack the methods to solve such problems cleanly (believe me, the modern c++ people have tried). Usually this is the case because they believe strongly that their methods work despite evidence.

As for UB, the compiler has the final say. Nobody should write nontrivial c without understanding their compiler, the same as nobody should write c without understanding their text editor.

Code in other languages breaks between versions, in c there are projects with code from every version at once!

Looking at it another way, work put into a c compiler enables you to write nontrivial code.

hunterpayne1mo ago

What all these C programmers are pointing out is 2 fold:

- Making a Turing machine have deterministic and predictable results is hard.

- Modern hardware is complex and getting all hardware to behave the same way requires a strong mathematical abstraction.

C was never intended to be a fully defined mathematical abstraction. It was a language which was easy to write a compiler for. That's its original strength. Trying to make it something it isn't is the problem. Either choose a language which does have such abstractions or understand the drawbacks of the tool you are using.

Right tool for the right job.

keyle1mo ago

When talking UB, putting C and C++ in the same basket is basically like comparing drunk driving a car and riding a bicycle sober... Both means of transport, very different experience.

wyldfire1mo ago

Maybe we should criminalize writing articles about Undefined Behavior that have a "So what do we do now?" subheader but omit any mention of UBSan.

bvrmn1mo ago

I really like Zig's approach to UB. Especially alignment is a part of type. And all this wordy builtins for conversions. Starring to it makes you think what you doing wrong with data model it requires now 3 lines of casting expression.

1vuio0pswjnm71mo ago

"My point is that ALL nontrivial C and C++ code has UB."

Is "nontrivial" defined

How would one identify "nontrivial" C code

Is there an objective measure (defined)

Or is it a matter of personal opinion that could vary from person to person (undefined)

stackedinserter1mo ago

How can it be valid implementation of isxdigit?

``` int isxdigit(int c) { if (c == EOF) { return false; } return some_array[c]; } ```

If you write code like this, then everything in programming is UB.

benj1111mo ago

The issue for me with posts like this is that it misses the issue.

Unaligned pointer accesses are UB because different systems handle it differently. This 'should' be to allow the program to be portable by doing what the system normally does.

Instead it's been highjacked by compiler writers, with the logic that "X is UB, therefore can't happen, therefore can be optimised away."

Int c = abs(a) + abs(b); If (a > c) //overflow

Is UB because some system might do overflow differently. In practice every system wraps around.

That should be a valid check, instead it gets optimised away because it 'can't' happen.

C gives you enough rope to hang yourself. The compiler writers don't trust you to use the rope properly.

QuiEgo1mo ago

C does not abstract differences in underlying hardware well. Systems programmers know if they have an architecture that can't handle unaligned accesses or that the address they are doing load/stores from is a mmio register. Systems programmers know the difference between a virtual address and a physical address and have debugged MPU faults or MMU table walks and page faults more times than they want to think about.

C is horrible for trying to write a portable user-mode program in 2026. There are lots of better options.

C is great for writing low-level system code where you need to optimize performance down to the last cycle. It not abstracting away the hardware is super important for some use cases. A classic example is all of the platform-specific flavors of memcpy in the Linux kernel that are C/assembly hybrids hand-optimized for the SIMD pipelines of some CPUs.

C is a tool, Rust is a tool, Java is a tool, Python is a tool. Use the right tool for the job ¯\_(ツ)_/¯.

saltyoldman1mo ago

Probably not "everything" the vast vast vast majority of everything you are looking at on your screen right now is written in C.

y421mo ago

shameless plug, it's part of the Nerd Encyclopedia: it's also called "nasal demons".

https://nickyreinert.de/2023/2023-05-16-nerd-enzyklop%C3%A4d...

SanjayMehta1mo ago

I used to teach C programming and one time I got anonymous feedback: "when this instructor doesn't know the answer he says "it's compiler dependent.""

Shrug.

up2isomorphism1mo ago

U just need to read the title and 5 lines to know this must be a rust guy.

justmarc1mo ago

The art is actually making sure it all stays defined behavior

fithisux1mo ago

UB can also have impact in logical cohesion of codebase.

alper1mo ago

Isn't the article mostly saying that SPARC sucks?

DostLeFan1mo ago

Very interesting article. I'm in love with C++, and I cannot say that I'm a good developer, but interesting to discover where UB can be. (Sorry I'm not a good english speaker)

0x20cowboy1mo ago

Life is undefined behaviour.

el_pollo_diablo1mo ago

> probably meaning on an address that’s a multiple of sizeof(int), but who knows

Sigh. s/sizeof(int)/_Alignof(int)/.

There are good reasons for an implementation to have sizeof(int) = _Alignof(int) and not a mere multiple of it, but if you are going to discuss subtle points and UB, just stick to the language guarantees.

> But let’s say you have a modern machine, where NULL is a pointer to address zero, and you actually have an object there.

You don't program in C on such a machine. Or maybe memory is virtualized, and it does not matter that your object lives at physical address zero, as long as you can map a non-zero virtual address to it.

> So how do you print an uid_t?

    if ((uid_t)-1 < (uid_t)0) {
        // uid_t is signed
        printf("%" PRIdMAX, (intmax_t)id);
    } else {
        // uid_t is unsigned
        printf("%" PRIuMAX, (uintmax_t)id);
    }

> It’s not rare for the denominator to come from untrusted input.

It's not rare for the array index to come from untrusted input.

It's not rare for the supposedly valid UTF-8 string to come from untrusted input.

...

Why single out division? This problem affects every partially defined operation. In the case of division at least, everyone learned in school that thou shalt not divide by zero. Adding two untrusted integers and forgetting that signed overflow is UB, not defined as a modulo? Your average programmer is much less likely to see that coming.

    > unsigned char a = 0xff;
    > unsigned char b = 1;
    > unsigned char zero = 0;
    > bool overflowed = (a + b) == zero;
    >
    > unsigned char a = 0x80;
    > uint64_t b = a << 24;

Please. Convert your operands to wide enough types before the operation. Convert your results back to narrow enough types to compensate for integer promotion to wider types than you would have liked. Do that consistently, and you're good.

Here:

    unsigned char a = 0xff;
    unsigned char b = 1;
    unsigned char zero = 0;
    bool overflowed = (unsigned char)(a + b) == zero;

    unsigned char a = 0x80;
    uint64_t b = (uint32_t)a << 24;

groby_b1mo ago

"not correctly aligned (probably meaning on an address that’s a multiple of sizeof(int), but who knows)"

I stopped reading there. If you have decades of experience in C/C++ and don't know what that means (and that it's arch specific), I'll assume those decades were mostly the same year over and over.

C/C++ are horrible languages, but they deserve better opponents than that.

pphysch1mo ago

It's also worth highlighting that C is perhaps the most officially standardized programming language in history.

What a contradiction. Strong evidence that standard-driven programming language development is much worse than implementation-driven development. Standards should be used for data types and external interfaces/protocols, not programming languages.

feelamee1mo ago

> We need some way of fixing UB at scale, without committing AI slop nor overwhelming human reviewers.

Write compiler which will define all this behavior. Usually people forget that UB exists only in standard. In practice it is always defined.

P.S. of course, while your hardware + firmware staying unchanged

P.S. not always defined in documentation - I mean defined in e.g. code

synergy201mo ago

if c is more ub unsafe than it seems,what is the solution here

EGreg1mo ago

a good case can be made that use of C++ is a SOX violation

So Linus was right? But for a second reason too:

C++ is a horrible language. It’s made more horrible by the fact that a lot of substandard programmers use it, to the point where it’s much, much easier to generate total and utter crap with it. Quite frankly, even if the choice of C were to do _nothing_ but keep the C++ programmers out, that in itself would be a huge reason to use C.

That is, accepting C++ code from programmers who use C++ could be a SOX violation ;-)

VimEscapeArtist1mo ago

Wait until he discovers PowerShell ;D

Webhix1mo ago

maybe rewrite this in go?)

reinhash1mo ago

Rust.

grougnax1mo ago

Use Rust!

liamd19881mo ago

When use C ,keep using char* not mess with int*

ricardobeat1mo ago

I’ve been heavily invested in https://c3-lang.org/ the past couple months. How does it look from this perspective to someone with C experience?

bullen1mo ago

Everything in Java is defined behaviour, you need a VM with GC to remain sane.

Everything else is a waste of time!

logicchains1mo ago

The concept of undefined behaviour is also a very useful lens for understanding LLM-based coding. Anything you don't explicitly specify is undefined behavior, so if you don't want the LLM to potentially pick a ridiculous implementation for some aspect of an application, make sure to explicitly specify how it should be implemented.

j / k navigate · click thread line to collapse

718 comments

277 comments · 73 top-level

muvlon1mo ago· 36 in thread

Yes there is tons of surprising and weird UB in C, but this article doesn't do a great job of showcasing it. It barely scratches the surface.

Here's a way weirder example:

  volatile int x = 5;
  printf("%d in hex is 0x%x.\n", x, x);

thomashabets21mo ago

Author here.

> It barely scratches the surface.

I agree. The point of the post is not to enumerate and explain the implications of all 283 uses of the word "undefined" in the standard. Nor enumerate all the things that are undefined by omission.

The point of the post is to say it's not possible to avoid them. Or at least, no human since the invention of C in 1972 has.

And if it's not succeeded for 54 years, "try harder", or "just never make a mistake", is at least not the solution.

FTA:

muvlon1mo ago

Fair enough!

> And if it's not succeeded for 54 years, "try harder", or "just never make a mistake", is at least not the solution.

And I 100% agree. UB is way overused by these standards for how dangerous it is, and as a consequence using C (and C++) for anything nontrivial amounts to navigating a minefield.

3 more replies

lelanthran1mo ago

> The point of the post is to say it's not possible to avoid them. Or at least, no human since the invention of C in 1972 has.

What are you talking about? UB was coined only in the first C standard, in 1989. Prior to that there was no "If you do this, anything can happen". It was "If you do this, that will happen".

2 more replies

saghm1mo ago

> if nobody can do it right, how is it even fair to blame the programmer? My point is that ALL

1 more reply

tialaramex1mo ago

Volatile is a type system hack. They should have done a more principled fix, and certainly modern languages should not act as though "C did it" makes it a good idea.

HarHarVeryFunny1mo ago

1 more reply

MobiusHorizons1mo ago

Volatile on a non pointer value is not for MMIO, though, that’s typically for concurrency like with interrupts.

1 more reply

rcxdude1mo ago

Yeah, it's also cleaner to be able to mark particular reads and writes as having side effects as opposed to having it be a property of the variable.

tardedmeme1mo ago

Thr Linux kernel uses READ_ONCE and WEITE_ONCE which look like actual function calls which is very sensible.

saagarjha1mo ago

Source?

2 more replies

pron1mo ago

> In C, we can have a data race on a single thread and without any writes!

muvlon1mo ago

pjmlp1mo ago

1 more reply

simonask1mo ago

I think the article's point is that you don't actually have to get weird at all to run into UB.

kzrdude1mo ago

My go-to example of "UB is everywhere" is this one:

    int increment(int x) {
        return x + 1;
    }

Which is UB for certain values of x.

2 more replies

saghm1mo ago

jstimpfle1mo ago

If you want to be standards correct, yes you have to know the standard well. True. And you can always slip, and learn another gotcha. Also true. But it's still extremely flexible.

2 more replies

3form1mo ago

3 more replies

HarHarVeryFunny1mo ago

> In C, we can have a data race on a single thread and without any writes!

vlovich1231mo ago

> There should have been a carve out in the "unsequenced side effect" definitions for volatile variables.

mananaysiempre1mo ago

That said, your “common parlance” definition of “data race” is not the definition used by the C standard, so your last sentence is at best misleading in a discussion of standard C.

(Here “conflicting” and “happens before” are defined in the preceding text.)

tsimionescu1mo ago

Your first paragraph makes it sound as if the compiler will actually generate two reads of the value of some register, which might lead to unexpected effects at runtime for certain special registers.

For example, the following program:

  int y = rand();
  if (y != 8) {
    volatile int x;
    printf("%d: %d", x, x) ;
  } else {
    printf("y is 8");
  }

Can be optimized to always print "y is 8" by a perfectly standard compliant compiler.

2 more replies

rocketrascal1mo ago

Are you sure?

>unsequenced side effects on the same scalar object are UB

>6.5.3.3.8 tells us that the evaluations of function arguments are indeterminately sequenced w.r.t. each other.

Read 5.1.2.4.3:

"If A is not sequenced before or after B, then A and B are unsequenced."

"Evaluations A and B are indeterminately sequenced when A is sequenced either before or after B, but it is unspecified which."

With a footnote saying this:

"9)The executions of unsequenced evaluations can interleave. Indeterminately sequenced evaluations cannot interleave, but can be executed in any order."

berti1mo ago

Reading a register from a microcontroller peripheral may well reset it as an example of a possible side-effect here, and that's exactly the kind of thing you use volatile for.

smsm421mo ago

zahlman1mo ago

> Here's a way weirder example:

And yeah, being able to say "reading is a side effect" is important when for example you interact with certain memory-mapped devices.

sethev1mo ago

Yes, there is a data race there. The value of a volatile can be changed by something outside the current thread. That’s what volatile means and why it exists.

Edit: thread=thread of execution. I’m not making a point about thread safety within a program.

mananaysiempre1mo ago

1 more reply

trissylegs1mo ago

Can also represent a register that has an effect reading it. Reading a memory mapped register can have side effects. Like memory mapped io on a UART will fetch the next byte to be read.

frollogaston1mo ago

Was going to say the same thing until I saw this comment. volatile is defined the way I'd expect, plus it's a strange code example.

jstimpfle1mo ago

Not sure why you're being downvoted. That's completely right. The example is silly. The code is obviously bad, doesn't matter if it's UB or not.

But whatever language lawyer things, the code is obviously broken, with an obvious fix, so I'm not so interested in what its semantics should be. Here is the fix:

    volatile int x;
    // ...
    int val = x;  // volatile read
    printf("%x %d\n", val, val);

1 more reply

rramadass1mo ago

This has got nothing to do with data races etc. but everything to do with "Sequence Points and Single Update Rule" which is well described in C language specification.

See my comment here - https://news.ycombinator.com/item?id=48205760

RobotToaster1mo ago

With volatile it could be changed by an interrupt service routine between reads, so it makes sense.

nomel1mo ago

Or, it could be hardware that has a "clear flag on read" type behavior.

drysine1mo ago

What's weird about it?

If you are using volatile you are reading from a device port mapped to that address.

Since C doesn't mandate in which order function arguments are evaluated, you don't know which argument will be read from port first.

How can that be anything but UB?

imtringued1mo ago

The lack of argument sequencing feels utterly petty however.

parasti1mo ago· 25 in thread

summa_tech1mo ago

defgeneric1mo ago

simonask1mo ago

Excuse me, what? I was writing both C and C++ 20 years ago, and UB was a huge part of the conversation (and the curriculum) back then as well.

parasti1mo ago

keyle1mo ago

Computers used to be cool; now they're dangerous.

Every company keep harping on about safety and being exposed (being in the news): so the narrative against 'unsafe' is up the wazoo.

The new world is basically a bunch of city dwellers who haven't seen raw nature and you show them a lawn mower, they freak out. Blades that spin?!?!?! Madness!!

pjc501mo ago

If everything is going to be dependent on computers, it's probably important that they work and remain under their owner's control rather than whichever NK or Chinese hacker group gets to them first.

Can't talk about C without CVE.

1 more reply

Etheryte1mo ago

spacedcowboy1mo ago

Um, as an embedded developer, you don't develop the code to run on your machine, you develop it to run on the same target as you expect to deploy to, sitting on your desk next to you.

I have lots of my code running day-in, day-out on literally hundreds of millions of machines. The approach to "getting it working" is exactly OP's.

1 more reply

bregma1mo ago

    There are more things in heaven and earth, Horatio,
    Than are dreamt of in your philosophy

rramadass1mo ago

Because most of the people who post/write these articles do not actually know the C language specification nor understand its design.

I wrote about it here with links to further reading provided - https://news.ycombinator.com/item?id=48144734

Pannoniae1mo ago

Exactly, you write for your target, not some imaginary spec. The spec is only as useful as to predict what your target roughly does, it's not normative.

SomeoneOnTheWeb1mo ago

I have the opposite experience, so many subtle bugs that bite you only on specific scenarios, so much that I can't count.

sethev1mo ago

I wonder if it’s just the colorful metaphors and an opportunity to bring out examples of surprising behavior. Plus it’s a topic that can always stir up debates.

dminik1mo ago

If only it was that easy: https://silentsblog.com/2025/04/23/gta-san-andreas-win11-24h...

AndriyKunitsyn1mo ago

So, you never iterated past an array, you never used after a free(), you never tried doing i = ++i + ++i; ?

aldanor1mo ago

If there's no UBs then what will we programmers do, there won't be enough to debug and fix?

benj1111mo ago

1. It's been talked about for much longer than that.

You may not even come across that failure mode to know to 'fix' it. And good luck finding the issue unless you know about UB and what the compiler can and will do in such situations.

jakobnissen1mo ago

benj1111mo ago

I don't think C is hostile. C has UB for good reason. The problem is UB has been hijacked by the compiler writers for performance gains.

hedora1mo ago

There was a similar rush of articles like this a few years ago.

tl;dr: C defined language semantics, and leaves some behavior undefined. Each system that C is ported to has the ability to define the behavior however it wants.

This blows the mind of PL folks every decade or so.

Yeah, getting old… I’ll go find a cloud to yell at.

account421mo ago

There are a lot of Rust/whatever hipsters here that have defined their whole identity around hating C and C++.

virtualritz1mo ago

Like the author of the article, I write C/C++ since 30 years. Mostly close-to-the-metal code around computer graphics. Actually: wrote.

After switching to Rust five years ago I agree with all the Rust hipsters as far as disliking those languages go.

I just don't talk about it a lot. If every Rust person I know that was a C/C++ developer before was as outspoken about what they think of the latter, you'd see that these people are a majority.

We're just old hands who like to use stuff that works. And most of us don't get attached to code or languages.

Lately you can read about the dichotomy re. AI use.

I.e. developers who define them themselves through what they build/ideas are embracing LLMs; for what they can do.

I.e.: I am what I build.

Whereas developers for whom software engineering is a craft that defines them hate them openly.

I.e.: I am how I build.

Also you can not dislike something and still not speak about it. Because you decided to not care.

pjmlp1mo ago

As C++ hipster since 1992, the problem is really C and any language that includes its semantics as subsets.

Just like TypeScript can't get rid of JavaScript WATs.

hnarn1mo ago

1 more reply

kzrdude1mo ago

I trust your historical C usage was more productive than that..

quelsolaar1mo ago· 17 in thread

The 5 stages of learning about UB in C:

-Denial: "I know what signed overflow does on my machine."

-Anger: "This compiler is trash! why doesn't it just do what I say!?"

-Bargaining: "I'm submitting this proposal to wg14 to fix C..."

-Depression: "Can you rely on C code for anything?"

-Acceptance: "Just dont write UB."

matheusmoreira1mo ago

What stage is the "just make the compiler define the undefined" stage?

quelsolaar1mo ago

A better way to think about UB is as a contract between developer and implementation, so that the implementations can more easily reason about the code. How would you optimize:

(x * 2) / 2

1 more reply

gpderetta1mo ago

> Unaligned access? Packed structs.

1 more reply

Georgelemental1mo ago

> Strict aliasing? Union type punning. Literally documented to work in any compiler that matters, despite the holy C standard never saying so.

It does say so, actually, since C99 TC3 (DR 283).

1 more reply

lelanthran1mo ago

> What stage is the "just make the compiler define the undefined" stage?

It can be left as implementation defined, which means that the compiler can't simply do arbitrary things, it needs to document what it would do.

Same here; I built a few non-trivial things that passed the first attempt at tooling (valgrind, UBsan with tests, fuzzing, etc) with no UB issues found.

1 more reply

pjmlp1mo ago

Lost in the submission attempt to WG14.

thomashabets21mo ago

Author here.

> -Acceptance: "Just dont write UB."

The point of my article is that this is not possible. This cannot be our end state, as long as humans are the ones writing the code. No human can avoid writing UB in C/C++.

jart1mo ago

2 more replies

frollogaston1mo ago

"Just don't write UB" sounds like still part of the bargaining stage at best

im3w1l1mo ago

In C, acceptance is "I will write UB and it will eventually lead to something bad happening"

superxpro121mo ago

Just work on embedded devices like I did lol. It's so nice to write software targeting a specific cpu.

Ygg21mo ago

> -Acceptance: "Just dont write UB."

Just switch to a saner language.

And before I get attacked for being a Rust shill, I meant Java :P

The bar is so low it's floating near the center of the Earth.

dns_snek1mo ago

> And before I get attacked for being a Rust shill, I meant Java :P

If all you want is C but less insane then the obvious answer here is Zig.

4 more replies

p2detar1mo ago

> Just switch to a saner language.

And where's the fun in that?

1 more reply

ErroneousBosh1mo ago

Okay, so Java compiles to machine code now?

Because the last time I looked it appeared to need some godawful slow bytecode interpreter that took up thousands of kilobytes of RAM.

4 more replies

17186274401mo ago

> -Denial: "I know what signed overflow does on my machine."

whizzter1mo ago

1 more reply

beeforpork1mo ago· 14 in thread

thomashabets21mo ago

Author here.

> an unaligned pointer in itself is UB

Yup. Per the "Actually, it was UB even before that" section in the post.

> UB is not on the HW, it has nothing to do with crashes or faults

Yeah. I tried to convey this too, but I'm also addressing the people who say "but it's demonstrably fine", by giving examples. Because it's not.

account421mo ago

Which is totally fine and expected for any decent programmer. Casting pointers is clearly here be dragons territory.

simonask1mo ago

It's perfectly reasonable to expect any load through `int*` to just load 4 bytes from memory, done and done. They get surprised that it is far from the whole story, and the result is UB.

Meanwhile, the actual computers we have been using for decades have no problems actually just loading 4 bytes through any arbitrary pointer with zero overhead. But no.

6 more replies

array_key_first1mo ago

2019841mo ago

>an unaligned pointer in itself is UB, not only an access to it.

Can someone point to where the standard states this?

1 more reply

imtringued1mo ago

This type of UB is fine and nobody really complains about hardware differences leading to bugs.

jcranmer1mo ago

stilley21mo ago

Does that mean that if I have a struct with #pragma pack(push, 1) I can't use pointers to any members that don't happen to be aligned?

saagarjha1mo ago

This is a non-standard extension, so your compiler may provide stronger guarantees.

1 more reply

tovej1mo ago

But that seems obvious. You can't load an integer from an unaligned address.

It's not only C-level is it. There's no (guarantee across architectures for) machine code for that either.

codeflo1mo ago

> You can't load an integer from an unaligned address.

1 more reply

matheusmoreira1mo ago

Sure you can. In many architectures it works just fine. Works perfectly in x86_64, for example. It's just a little slower.

1 more reply

mbel1mo ago

Unless your code targets some exotic architecture, like idk x86.

1 more reply

pjc501mo ago

You missed the point: the pointer existing as a value of that type at all is UB, even if you never try to access anything through it and no corresponding machine code is ever emitted.

1 more reply

bestouff1mo ago· 13 in thread

inkysigma1mo ago

marcosdumay1mo ago

There is that famous example where when you write an infinite loop last thing in your main, a function that you never called runs instead.

account421mo ago

Infinite loop without side effects == program stuck and not responding on user input and not outputting anything. That's not something a useful program will ever want to do.

4 more replies

17186274401mo ago

That's only true in C++ though, not in C.

1 more reply

eru1mo ago

Yes, a crash is about the most benign UB: at least it's highly visible.

In worse scenarios, your programme will silently continue with garbage, or format your hard disk or give attackers the key to the kingdom.

17186274401mo ago

rando12341mo ago

account421mo ago

That's a feature, not a problem.

anilakar1mo ago

Another commenter suggested using LLMs, but I disagree. Having clangd emit warning squiggles for unchecked operations (like signed addition) would be a good start.

flohofwoe1mo ago

2 more replies

amoss1mo ago

4gotunameagain1mo ago

gpderetta1mo ago

Consider:

   enum op_t{ add, mul };
   int exec(op_t op, int a, int b) {
       if(op == add) { return a+b; }
       if(op == mul) { return a\*b; }
   }

   c = exec(add, a,b);

Should be the compiler be prevented from inlining exec and constant-propagating op and removing the mul branch? What about if a and b are constants and the addition itself is optimized away?

greysphere1mo ago· 9 in thread

guerby1mo ago

Ada 83 has no UB on call stack overflow, from the reference manual :

http://archive.adaic.com/standards/83lrm/html/lrm-11-01.html

"STORAGE_ERROR This exception is raised in any of the following situations: (...) or during the execution of a subprogram call, if storage is not sufficient."

veltas1mo ago

So it's just as useful as when your stack area ends with a page that will segfault on access, or your CPU will raise an interrupt if stack pointer goes beyond a particular address?

4 more replies

eru1mo ago

That's not true at all.

stevenhuang1mo ago

The examples are unequivocally UB. Full stop.

You end up building a foundation on quicksand. That's the danger of UB.

flohofwoe1mo ago

> The examples are unequivocally UB. Full stop.

4 more replies

greysphere1mo ago

Pointers have pros and cons. UB has pros and cons. Let's try to educate people about them.

1 more reply

pjc501mo ago

UB based on input can be an exploit vector.

layer81mo ago

Unvalidated input can always be an exploit vector.

1 more reply

account421mo ago

Yes, this article is pretty much the definition of FUD.

stackghost1mo ago· 9 in thread

Anyone who uses the construction "C/C++" doesn't write modern C++, and probably isn't very familiar with the recent revisions despite TFA's claims of writing it every day for decades.

Far from being just "C with classes", modern C++ is very different than C. The language is huge and complex, for sure, but nobody is forced to use all of it.

No HN comment can possibly cover all the use cases of C++ but in general, unless you have a very good reason not to:

- eschewing boomer loops in favor of ranges

- using RAII with smart pointers

- move semantics

- using STL containers instead of raw arrays

- borrowing using spans and string views

veltas1mo ago

stackghost1mo ago

Well, the author explicitly refers to "C/C++" as one language:

>After all, C/C++ is not a memory safe language.

1 more reply

thomashabets21mo ago

Author here.

In the context of UB discussion, the arguments apply equally to C and C++.

How would you write that?

I entirely agree with all your points that C and C++ are completely different languages at this point. And yet I wanted to write this post about something that is true for both.

rectang1mo ago

> the upshot is you never need to deal with the Rust community

In the end, everything comes down to culture war.

stackghost1mo ago

Perhaps we should rewrite our culture in Rust.

jim334421mo ago

You can write C++ in a way that's similar to C if you want and run into some of the same UB. Normally I don't like the "C/C++" thing, but in this context it makes sense.

SpaceNugget1mo ago

I totally agree that modern c++ is pretty robust if you are both a well seasoned developer and only stick to a very blessed subset of it's features and avoid the historical baggage.

Also, safe != non-UB. TFA isn't so much about memory safety anyway.

m-schuetz1mo ago

flohofwoe1mo ago

"C/C++" is still a useful term for the common C/C++ subset :)

Also:

- eschewing boomer loops in favor of ranges

Those "boomer loops" compile infinitely faster than the new ranges stuff (and they are arguably more readable too): https://aras-p.info/blog/2018/12/28/Modern-C-Lamentations/

- borrowing using spans and string views

Those are just as unsafe as raw pointers. It's not really "borrowing" when the referenced data can disappear while the "borrow" is active.

dmitrygr1mo ago· 8 in thread

I stoped reading about here:

    > bool parse_packet(const uint8_t* bytes) {
    >   const int* magic_intp = (const int*)bytes;   // UB!

raphlinus1mo ago

Of course, this exchange just demonstrates the larger point, that even a world-class expert in low level programming can easily make mistakes in spotting potential UB.

flohofwoe1mo ago

> Of course, this exchange just demonstrates the larger point, that even a world-class expert in low level programming can easily make mistakes in spotting potential UB.

1 more reply

gritzko1mo ago

1 more reply

dmitrygr1mo ago

1 more reply

thomashabets21mo ago

Author here.

> A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned71) for the referenced type, the behavior is undefined.

C23 6.3.2.3p7.

stevenhuang1mo ago

Byte and int has different alignment requirements. It is UB the moment you make such a ptr.

Great way to demonstrate the point of the article.

gritzko1mo ago

That better be marked "historical". At least, Lemire says:

(while in the olden days, a program may crash on unaligned access, esp on RISC)

1 more reply

dmitrygr1mo ago

Without memcpy there is no guarantee that that line produces an invalid pointer

I don’t see what spec part would prohibit that cast from validly compiling to

   BIC r3, r0, #3

Spec only guaranteed round-trip through char* of properly aligned for type pointers. This doesn’t break that.

__0x011mo ago· 7 in thread

> A problem with this is that in order to confirm the findings, you’ll need an expert human. But generally expert humans are busy doing other things.

The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.

LLM generated code will eventually contain UB.

EDIT: added "eventually"

flohofwoe1mo ago

https://gist.github.com/Earnestly/7c903f481ff9d29a3dd1

jcranmer1mo ago

The C committee is cleaning up a lot of UB (check https://www.open-std.org/jtc1/sc22/wg14/www/wg14_document_lo... for paper titles like "slaying earthly demons").

But don't misunderstand the goal of that: C and C++ will never get rid of UB. The result of dereferencing an invalid pointer is UB, will always remain UB, and really cannot be anything other than UB.

layer81mo ago

The easy cases like you cite are also those that don’t cause problems in practice. I’m not sure that would help all that much, other than to slightly reduce internet criticism.

1 more reply

thomashabets21mo ago

Author here.

> The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.

Yup. But the point of the article is that even expert humans cannot do this alone. And as I wrote, LLM+junior won't suffice either. We need LLM+senior experts.

And it's a problem that we have way more existing UB than expert capacity.

I didn't even use the best model, complex code target, or time. I just wanted to choose a target that has a high chance of having very good experts already having audited it.

eru1mo ago

lelanthran1mo ago

> LLM generated code will eventually contain UB.

Yes.

Even in languages other than C (i.e. you will get behaviour that nothing in the input specified).

When LLMs generate code, all languages have UB.

eru1mo ago

That's a bit silly.

UB means literally no restrictions. So if you standard says 'you have to crash with an error message' that's already no longer UB.

1 more reply

amiga3861mo ago· 7 in thread

Can anyone explain why this is undefined behaviour? UBSan calls it "indirect call of a function through a function pointer of the wrong type"

    struct foo {int i;};
    int func(struct foo *x) {return x->i;}
    int main() {
        int (*funcptr)(void*) = (int (*)(void*)) &func;
        struct foo foo = { 42 };
        return funcptr(&foo);
    }

While this is all kosher per the language lawyers:

    struct foo {int i;};
    int func(void *x) {return ((struct foo *)x)->i;}
    int main() {
        int (*funcptr)(void*) = &func;
        struct foo foo = { 42 };
        return funcptr(&foo);
    }

jcranmer1mo ago

C23 §6.5.2.2p7

> If the function is defined with a type that is not compatible with the type (of the expression) pointed to by the expression that denotes the called function, the behavior is undefined.

amiga3861mo ago

I get that it's defined that way, but I'd really like to know why.

Is there any work involved in casting void* to struct* (on any architecture) that a plain function pointer would miss out?

1 more reply

wavemode1mo ago

It's undefined behavior due to the "strict aliasing" rule. You're simply not allowed to cast one pointer type to another (ever!) except for the following exceptions:

- casting an object pointer to or from void*

- casting an object pointer to or from char*

jcranmer1mo ago

Sorry, this explanation is plain wrong.

j16sdiz1mo ago

Two function pointer (in practice) compatible or not depends on machine specific calling convention.

I guess enumerating all the possibility is just .. don't look right? make the standard too long and complex?

tomp1mo ago

Casting to a pointer of incompatible type is UB. The exception is casting to char*.

amiga3861mo ago

Tell me why struct* is incompatible with void* when it's such a standard case in C that you don't need a cast:

    struct foo *x = malloc(sizeof(struct foo)); /* malloc returns void* */

Or rather, tell me why the C11 standards committee decided to declare that struct* is incompatible with a void*

1 more reply

veltas1mo ago· 6 in thread

From the ANSI C standard:

  3.16 undefined behavior: Behavior, upon use of a nonportable or erroneous program construct, of erroneous data, or of indeterminately valued objects, for which this International Standard imposes no requirements.  Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message).

dataflow1mo ago

> but that the consequence of this should be somewhat bounded or as expected for the target machine.

Aren't "unpredictable results" and "no requirements" contrary to the idea that the behavior would be "somewhat bounded"?

veltas1mo ago

3 more replies

thomashabets21mo ago

Author here.

I touched on this in the "it's not about optimizations" section. It's not the compiler is out to get you. It's that you told it to do something it cannot express.

veltas1mo ago

17186274401mo ago

The behaviour is bounded by the capability of your machine. It is unlikely that your desktop computer launches a nuclear missile, unless you worked for it to be able to do that.

lelanthran1mo ago

> Is it just me or did compiler writers apply overly legalistic interpretation to the "no requirements" part in this paragraph?

I've (fruitlessly) had this discussion on HN before - super-aggressive optimisations for diminishing rewards are the norm in modern compilers.

debugnik1mo ago· 4 in thread

As much as I agree with the intro, these examples aren't good and the overall article is just a veil for pushing LLM coding.

gblargg1mo ago

boxed1mo ago

Not good how? Are they TRUE? If so that's super bad.

IshKebab1mo ago

1 more reply

HelloNurse1mo ago

Some of the examples are somewhat formally true in theory and bullshit in practice; some are quite hallucinatory.

  - Creating a potentially troublesome misaligned int pointer is a precisely localized and completely explicit user mistake, not something that just happens because it's C.
  - Passing signed char to character classification functions that expect an unsigned char (disguised as an int) is a very specific dumb user error. The C standard could specify that all negative inputs, including EOF and invalid signed char values, are classified as not belonging to the character class, but I doubt the current undefined behaviour in isxdigit() etc. implementations ever went beyond accepting invalid inputs.
  - Casting floating point values to integer values in general requires taking care of whether the FP values are small enough to be represented and what to do with NaN and Inf values: not the language's responsibility. C offers a toolbox of tests, not ready-made application specific error handling.
  - Expecting C to handle "address zero" in physical memory in ways that conflict with NULL in source code denotes a complete lack of understanding of what a program is. Where stuff in an executable is loaded in memory, in the rare cases when it matters, can surely be affected with platform specific extensions, possibly at the level of linker commands with nothing appearing in the C source code.

1 more reply

weinzierl1mo ago· 4 in thread

A fun one that'd fit list be sequence point violations like

    i = i++

radiospiel1mo ago

Fun, sure, but also GCC and Clang will both warn with -Wall (-Wsequence-point / -Wunsequenced).

account421mo ago

This would also be a code smell even if it was well defined.

marcosdumay1mo ago

Yes, it should be an explicit error. Not undefined.

leni5361mo ago

Only in C, that one is defined in C++.

edit: I'm not sure it's even undefined in C.

akiarie1mo ago· 4 in thread

C is still, by far, the simplest language that we have.

People complain about C as though they know how to fix it.

simonask1mo ago

C is not a simple language in the sense that writing software in C is simple, and I think that's the only useful way to understand the word "simple" in this context.

Brainfuck is "simple" by any other definition as well, but that's not a useful quality.

spacedcowboy1mo ago

That doesn't mean the C is a safer language than Swift, or a less-capable language than Swift. But in terms of "easy to understand along the happy-path", it's a lot easier to get going in C.

Brainfuck is not at all simple, from that point of view. This is a valid Brainfuck program:

>+++++++++[<++++++++>-]<.>+++++++[<++++>-]<+.+++++++..+++.[-]>++++++++[<++++>-]<. >+++++++++++[<+++++>-]<.>++++++++[<+++>-]<.+++.------.--------.[-]>++++++++[<++++ >-]<+.[-]++++++++++.

This is the equivalent C program

#include <stdio.h> int main() { printf("Hello world!\n"); }

One of these is far simpler than the other.

[edit: changed to make the examples do the same thing]

1 more reply

jeroenhd1mo ago

C sits right in the middle between assembly and BASIC in terms of simplicity. You can't do a simple popcnt, but you can implement jump tables.

It's slower than Fortran and, depending on the platform, cobol. It's a bigger minefield than any language that came after it barring C++.

The only real advantage I can ascribe to C is that it's actually still being used after all these years, and it mostly works similarly on most hardware, like a Java for people who enjoy the casino.

There is no fixing C without changing what C really is.

dns_snek1mo ago

Can you elaborate what do you think C has in terms of simplicity that Zig doesn't, and which "same kinds of issues" do you think it has?

I'm not an expert in either language but my anecdotal experience disagrees with this - writing Zig has been far simpler and less error-prone than writing C.

elnatro1mo ago· 4 in thread

Is there a way to avoid undefined behavior Im C then? Could we write a new C compiler that adds some checks and fixes (e.g. raise documented exceptions) to each undefined behavior?

u1hcw9nx1mo ago

Good open source ones:

Frama-C

IKOS (from NASA)

elnatro1mo ago

It’s been a while since I programmed in C. Thank you for these resources.

saagarjha1mo ago

Not all of them but there are many tools that can try to define behavior for this code to help shake them out of your codebase.

peterfirefly1mo ago

ubsan.

Doesn't catch all of it.

1 more reply

maple31421mo ago· 3 in thread

simonask1mo ago

Intuitively yes - the program will be compiled as if B-inputs are never passed to the program, and that can include eliminating code that tries to detect B-inputs.

mbrock1mo ago

4 more replies

17186274401mo ago

Yes, that's a good summary.

raluk1mo ago· 3 in thread

In C / C++ there are two kinds of undefined behaviour. One is where there is written in standard what UB is. Another one is everthing else that is not in standard.

thaumasiotes1mo ago

Technically, that's only one kind, because it's written in the standard that anything not mentioned in the standard is undefined behavior.

cepepe1mo ago

One kind, but two different classes of undefined behaviour.

wiseowise1mo ago

https://en.wikipedia.org/wiki/There_are_unknown_unknowns

jraph1mo ago· 3 in thread

Yet another push to use LLMs after casting fear. Now it should be illegal not to use LLMs. A good start of the day.

(I hope casting fear is not UB)

raverbashing1mo ago

> (I hope casting fear is not UB)

I'm sure that's UB in C

In C++ just use <reinterpret_cast>

wg01mo ago

The irony is unmistakable.

stevenhuang1mo ago

There is nothing ironic in letting an llm have a pass at identifying potential UB and other correctness issues in C code.

I say this as an experienced C developer.

1 more reply

jb19911mo ago· 2 in thread

debugnik1mo ago

I was about to call out that the code is supposed to be C and not C++, but I double checked and I realised it actually says std::atomic<int>, not atomic_int!

jb19911mo ago

Most C++ today will be immediately obvious and not accidentally mixed up with C.

1 more reply

pizlonator1mo ago· 2 in thread

The problem is incorrectly assuming that the spec is meaningful in some kind of rigorous way.

It’s not. All that matters is what C compilers actually do and what real C programs expect.

This is a good thing. It creates a culture where the two sides meet each other where they’re at

BearOso1mo ago

pizlonator1mo ago

This is true.

But why I’m saying has always been true. What has changed is that the effective portability of C and C++ code has increased due to the reduction in number of compilers and arches

mjs011mo ago· 2 in thread

Integer promotion seems to be the source of many signed integer overflow UB. Why does C have it? Does integer promotion ever have a good part?

saagarjha1mo ago

Yes, it simplifies a lot of code that would otherwise be littered with casts.

peterfirefly1mo ago

Could be fixed by having a nicer casting syntax (like Rust) or by not having so damn many scalar types that are used in practice.

"Explicit casts only" worked fine in Modula-2, which doesn't have as many scalar types.

1 more reply

codeflo1mo ago· 2 in thread

> The compiler, and really the underlying hardware too, is playing a game of telephone with your UB intentions.

thomashabets21mo ago

Author here

> The part about hardware is wrong BTW

> the hardware semantics are clearly defined

I'm not saying the CPU has UB, and I wonder what part made you think I did.

codeflo1mo ago

fjfaase1mo ago· 2 in thread

Is comparing a signed integer with an unsigned integer UB? I resently wrote some code and compiled it with gcc to x86_64 (without optimization) that returned an incorrect answer.

Karliss1mo ago

No UB, but the integer promotions rules apply.

When comparing signed and unsigned integers of same size the signed one will be converted to unsigned. In a reasonably configured project compiler will warn about it.

In case of integers smaller than int, promotion to int happens first.

In case of signed and unsigned integers of different size, the smaller one will be converted to bigger one.

benchloftbrunch1mo ago

It's not UB. Integer promotion applies, the signed int is implicitly coerced to unsigned (or the other way around - don't remember which.)

my-next-account1mo ago· 2 in thread

Hello, it's me. I'm not afraid of UB.

my-next-account1mo ago

To be honest, miscompilations because of UB is exceedingly rare, and we do a lot of weird shit in our code.

saagarjha1mo ago

You should be!

mbrock1mo ago· 2 in thread

most languages don't even HAVE a specification so in most languages literally EVERYTHING everything is undefined behavior

oersted1mo ago

mbrock1mo ago

1 more reply

momo261mo ago· 2 in thread

Debugging in C is soooo hard. When I was writing Malloc Lab in system course, there were uncountable undefined and out of range :(

flohofwoe1mo ago

Yet, debugging memory corruption issues in C and C++ code with modern compiler toolchains and memory debugging tools is infinitely easier than 25 years ago.

(e.g. just compiling with address sanitizer and using static analyzers catch pretty much all of the 'trivial' memory corruption issues).

feelamee1mo ago

I think vice versa - C is so simply, that debugging it is just a pleasant walk.

Especially compared with modern languages with lambdas/exceptions/virtual functions and so on.

The one thing I see can make it harder is function pointers.

nullpwr1mo ago· 2 in thread

Excellent post. But it's addressed to the wrong people.

The problem lies with compilers, not with the language and its specification, or with the creators of the C programming language.

Anyone can write a compiler that transforms all undefined behaviors (UB) into defined behaviors (DB). And your compiler will be used by people, including me.

HarHarVeryFunny1mo ago

I'd say the unaligned pointer one is the language's fault. The language should not let you create an an invalid pointer, or at least warn you when you are doing so.

nullpwr1mo ago

> The language should not let you create an an invalid pointer, or at least warn you when you are doing so

completely agree!

1 more reply

rom1v1mo ago· 1 in thread

A concrete example of undefined behavior caused by an unaligned pointer: https://pzemtsov.github.io/2016/11/06/bug-story-alignment-on...

gblargg1mo ago

Specifically on x86 where it's assumed that won't cause problems.

rurban1mo ago· 1 in thread

Very bad advice. Of course good new LLM's know about UB, but you still need to use ubsan (ie - fsanitize=undefined), and not your LLM.

formerly_proven1mo ago

Coding agents write unsound Rust any day, too. unsafe impl Send … is much easier than fixing a bad design and it might even work momentarily.

lelanthran1mo ago· 1 in thread

I read through this in detail... Is it just me, or are these things that are invoked by intentionally bypassing the typing?

I mean, you have to go out of your way and use a cast to get the UB in the first example.

For the float -> int conversion, converting a float to an int without picking a conversion does not make sense in the first place - math.h has rounding and ceiling functions.

> For all you know the compiler has no internal way to even express your intention here.

I'm human, not a compiler, and even I cannot tell what the intention is behind trying to call NULL as a function. What exactly is expected to happen?

> Because the argument needs to be a pointer, and the NULL macro may be misinterpreted as an integer zero.

I think only the final one is of note (the 24-bit shift assigned to a uint64_t).

account421mo ago

commandlinefan1mo ago· 1 in thread

jeroenhd1mo ago

In the world of microcontrollers, CHAR_BIT can be 16 or some other funky number. char is usually 8 bits in size, though.

danborn261mo ago· 1 in thread

The scariest part is how many production systems rely on undefined behavior without anyone knowing until a compiler update breaks everything.

coolThingsFirst1mo ago

Where does scary part come from, they run on planes?

tomcam1mo ago· 1 in thread

I fear I will be downvoted into oblivion but I also want to learn from this.

Given the above predicate: Isn’t everything described in the article as it should be?

Add too much to the language and it becomes less possible to implement on new architectures, right? Because the undefined behavior lets implementors stand up new compilers fairly quickly.

For less undefined behavior isn’t it better to use languages that have that in their DNA? D, Zig, Go, Java, etc?

vladms1mo ago

> Given the above predicate: Isn’t everything described in the article as it should be?

I think the real trick question is "as it should be for whom?".

Reading the comments I think people underestimate the complex interaction between:

- engineers that design hardware (they don't care much about the compiler, except when it has to fix their mistakes)

- engineers that do the compiler (they have to struggle with all quirks of the new architecture and all of the complaints of the users)

- users working on one architecture all their lives

For the compiler people, yes, probably most what is described is as it should be. For the users (that care about performance and not making porting efforts), probably no.

I am sure there are things that can be improved, and standards evolve. But the problem is very complex given the sheer amount of code written and the strange architectures out there.

kajaktum1mo ago· 1 in thread

I want a language that is a group of bit (0,1) and the xor operator. Everything else is built on top of that.

benchloftbrunch25d ago

Xor isn't Turing complete sadly.

1 more reply

cracki1mo ago· 1 in thread

We know. This is not news.

boxed1mo ago

It seems to be to many many programmers who keep using C++

NooneAtAll31mo ago· 1 in thread

feels like https://xkcd.com/1499/

the only people complaining about being able to do awful things are people that do awful things

gritzko1mo ago

- a metal bar always sinks

- unless you are trying to sink it in mercury. then it floats

- unless it is an uranium bar

- go sink uranium bars in mercury yourself

nokeya1mo ago· 1 in thread

Ok, and?

wg01mo ago

"Rewrite everything in Rust. OMG universe is written in Rust so memory safe with zero allocations"

JonChesterfield1mo ago

Well, you can't write malloc in conforming C, which hurts rather more than remembering to write bitcast as memcpy on char pointers.

Or you're writing C++ and way more exposed to the adversarial-and-benevolent compiler experience.

sltr1mo ago

For a deep dive on UB with printf, see https://srs.fyi/see-conversions/

> When programming in C, to avoid unexpected pitfalls, one must be acutely aware of a whole slew of implicit behaviors (some of which are implementation-defined or even undefined).

bkallus1mo ago

psim11mo ago

casey21mo ago

As for UB, the compiler has the final say. Nobody should write nontrivial c without understanding their compiler, the same as nobody should write c without understanding their text editor.

Code in other languages breaks between versions, in c there are projects with code from every version at once!

Looking at it another way, work put into a c compiler enables you to write nontrivial code.

hunterpayne1mo ago

What all these C programmers are pointing out is 2 fold:

- Making a Turing machine have deterministic and predictable results is hard.

- Modern hardware is complex and getting all hardware to behave the same way requires a strong mathematical abstraction.

Right tool for the right job.

keyle1mo ago

When talking UB, putting C and C++ in the same basket is basically like comparing drunk driving a car and riding a bicycle sober... Both means of transport, very different experience.

wyldfire1mo ago

Maybe we should criminalize writing articles about Undefined Behavior that have a "So what do we do now?" subheader but omit any mention of UBSan.

bvrmn1mo ago

1vuio0pswjnm71mo ago

"My point is that ALL nontrivial C and C++ code has UB."

Is "nontrivial" defined

How would one identify "nontrivial" C code

Is there an objective measure (defined)

Or is it a matter of personal opinion that could vary from person to person (undefined)

stackedinserter1mo ago

How can it be valid implementation of isxdigit?

``` int isxdigit(int c) { if (c == EOF) { return false; } return some_array[c]; } ```

If you write code like this, then everything in programming is UB.

benj1111mo ago

The issue for me with posts like this is that it misses the issue.

Unaligned pointer accesses are UB because different systems handle it differently. This 'should' be to allow the program to be portable by doing what the system normally does.

Instead it's been highjacked by compiler writers, with the logic that "X is UB, therefore can't happen, therefore can be optimised away."

Int c = abs(a) + abs(b); If (a > c) //overflow

Is UB because some system might do overflow differently. In practice every system wraps around.

That should be a valid check, instead it gets optimised away because it 'can't' happen.

C gives you enough rope to hang yourself. The compiler writers don't trust you to use the rope properly.

QuiEgo1mo ago

C is horrible for trying to write a portable user-mode program in 2026. There are lots of better options.

C is a tool, Rust is a tool, Java is a tool, Python is a tool. Use the right tool for the job ¯\_(ツ)_/¯.

saltyoldman1mo ago

Probably not "everything" the vast vast vast majority of everything you are looking at on your screen right now is written in C.

y421mo ago

shameless plug, it's part of the Nerd Encyclopedia: it's also called "nasal demons".

https://nickyreinert.de/2023/2023-05-16-nerd-enzyklop%C3%A4d...

SanjayMehta1mo ago

I used to teach C programming and one time I got anonymous feedback: "when this instructor doesn't know the answer he says "it's compiler dependent.""

Shrug.

up2isomorphism1mo ago

U just need to read the title and 5 lines to know this must be a rust guy.

justmarc1mo ago

The art is actually making sure it all stays defined behavior

fithisux1mo ago

UB can also have impact in logical cohesion of codebase.

alper1mo ago

Isn't the article mostly saying that SPARC sucks?

DostLeFan1mo ago

Very interesting article. I'm in love with C++, and I cannot say that I'm a good developer, but interesting to discover where UB can be. (Sorry I'm not a good english speaker)

0x20cowboy1mo ago

Life is undefined behaviour.

el_pollo_diablo1mo ago

> probably meaning on an address that’s a multiple of sizeof(int), but who knows

Sigh. s/sizeof(int)/_Alignof(int)/.

> But let’s say you have a modern machine, where NULL is a pointer to address zero, and you actually have an object there.

> So how do you print an uid_t?

    if ((uid_t)-1 < (uid_t)0) {
        // uid_t is signed
        printf("%" PRIdMAX, (intmax_t)id);
    } else {
        // uid_t is unsigned
        printf("%" PRIuMAX, (uintmax_t)id);
    }

> It’s not rare for the denominator to come from untrusted input.

It's not rare for the array index to come from untrusted input.

It's not rare for the supposedly valid UTF-8 string to come from untrusted input.

...

    > unsigned char a = 0xff;
    > unsigned char b = 1;
    > unsigned char zero = 0;
    > bool overflowed = (a + b) == zero;
    >
    > unsigned char a = 0x80;
    > uint64_t b = a << 24;

Here:

    unsigned char a = 0xff;
    unsigned char b = 1;
    unsigned char zero = 0;
    bool overflowed = (unsigned char)(a + b) == zero;

    unsigned char a = 0x80;
    uint64_t b = (uint32_t)a << 24;

groby_b1mo ago

"not correctly aligned (probably meaning on an address that’s a multiple of sizeof(int), but who knows)"

I stopped reading there. If you have decades of experience in C/C++ and don't know what that means (and that it's arch specific), I'll assume those decades were mostly the same year over and over.

C/C++ are horrible languages, but they deserve better opponents than that.

pphysch1mo ago

It's also worth highlighting that C is perhaps the most officially standardized programming language in history.

feelamee1mo ago

> We need some way of fixing UB at scale, without committing AI slop nor overwhelming human reviewers.

Write compiler which will define all this behavior. Usually people forget that UB exists only in standard. In practice it is always defined.

P.S. of course, while your hardware + firmware staying unchanged

P.S. not always defined in documentation - I mean defined in e.g. code

synergy201mo ago

if c is more ub unsafe than it seems,what is the solution here

EGreg1mo ago

a good case can be made that use of C++ is a SOX violation

So Linus was right? But for a second reason too:

That is, accepting C++ code from programmers who use C++ could be a SOX violation ;-)

VimEscapeArtist1mo ago

Wait until he discovers PowerShell ;D

Webhix1mo ago

maybe rewrite this in go?)

reinhash1mo ago

Rust.

grougnax1mo ago

Use Rust!

liamd19881mo ago

When use C ,keep using char* not mess with int*

ricardobeat1mo ago

I’ve been heavily invested in https://c3-lang.org/ the past couple months. How does it look from this perspective to someone with C experience?

bullen1mo ago

Everything in Java is defined behaviour, you need a VM with GC to remain sane.

Everything else is a waste of time!

logicchains1mo ago

j / k navigate · click thread line to collapse