I don't stick with C because I love it. If I'm writing something for my own purposes, I use Ruby. I've written some server code in Golang (non-production), and it's pretty nifty, even if the way it twists normal C syntax breaks my brain. I even dabble in the dark side (C++) personally and professionally from time to time. And in a previous life, I was reasonably proficient in C# (that's the CLR 2.0 timeframe; I'm completely useless at it in these crazy days of LINQ and the really nifty CLR 4 features...and there's probably even more stuff I haven't even become aware of).
But none of those languages would let me do what I need to do: zero-copy writes from the network driver through to the RAID backend. And even if they did, the pain of rewriting our entire operating system in Go or Rust or whatever would be way more than the alleviated pain of using a "nicer" language.
(We never use 'int', by the way. We use the C99 well-defined types in stdint.h. Could this value go greater than a uint32_t can represent? Make it a uint64_t. Does it need to be signed? No? Make sure it's unsigned. A lot of what he's complaining about is sloppy code. I don't care if your compiler isn't efficient when compiling sloppy code.)
There'd be no need to rewrite anything to work with Rust; the binaries emitted by the compiler should be just fine, assuming ABI compatibility. Maybe some changes to the way things are linked?
It'd definitely be tricky to use Rust for kernel drivers, but still so tempting!
Oberon, Modula-3 and D allow for it via their SYSTEM/Unsafe/@system modules, but the two former ones failed to get a dent into the OS market (for various reasons) and D still has some improvements to their memory model going on.
Also Ada and SPARK are usually the languages to reach for in life critical systems.
Also lets not forget before C became widespread outside UNIX, Modula-2 and Pascal dialects were saner alternatives.
Why not?
Do you turn off the TBAA in your compiler? In my experience most systems programmers either turn it off, or don't know the rules.
[edit] I forgot int is typically 32-bits on 64-bit targets. The same argument would still apply for uint16_t and smaller though.
Only if int is > 32 bits on your platform, which is quite rare these days.
Does TBAA mean type-based alias analysis?
https://en.wikipedia.org/wiki/Alias_analysis#Type-based_alia...
(EDIT: I keep forgetting HN doesn't support markdown.)
Also, consider a language like Ada for driver work.
A lot of people seem to assume that Chris (the author) was talking about managed memory, which he never mentioned once. Managed memory is runtime safety, a type system is compile-time safety. He's complaining about the type system. As an example:
> address memory as a flat range of bytes [...] I can then typecast into byte-aligned structs
You should never have to do that. You shouldn't be able to do that. Your job should be far simpler. Look at unique_ptr: a whole class of bugs are eliminated by this ZERO cost abstraction. Possibly what Chris is advocating is being able to describe what an I/O port is to the compiler and then using that abstraction to write your SCSI driver. This intent should be compiled down to as-good machine code (if not better) than what your C compiler would have given you - in the same way that unique_ptr is compiled.
I don't think any existing language gets this right.
Someone, at somepoint, has to do this though. Custom memory allocators are more or less predicated on having a byte buffer you chop up and use like this.
I think the best we get -- especially in driver code -- is well thought out design that have low cost abstractions between the device details and the application logic. But that seems like a library detail more than a compiler or language one.
Which is precisely the reason C is still used.
Until we have a language that produces at least as good results as C and is safer we're not going to see any change in this area.
Quite a lot of CPUs would just trap here. Assuming that unaligned access is allowed is a sin.
Or even worse -- for example, ARM CPUs usually round-down the misaligned address to the closest boundary when alignment checking is disabled. This means that attempting to access a 4-byte int at location 11 will silently let the CPU access it at location 8. This can manifest in some very nasty bugs.
Because this is very common practice in device or network code, in my experience.
Yes, it's "easier" to introduce some bugs in C than Ruby (or Go, or whatever), but that's because whoever wrote that code with the bug didn't know C well enough. Is that C's fault? Same can be said about any language, really.
If you don't know that String#match returns nil on unsuccessful matches and try to call MatchData#[], you'll get a NPE (something along the lines of "undefined method `[]' for NilClass"). This is very similar to dereferencing a NULL pointer in C[1].
[1]: I know dereferencing a NULL pointer in C is undefined behaviour, but your program will crash—if you're lucky enough—when you try to work with NULL pointers when you don't expect them.
So, yes, it is C's "fault" that it doesn't protect against classes of bugs that many other languages do. Sure, those languages have some of the same bugs that C does, but they're missing most of the very worst ones and that's really powerful. For example, a garbage collector protects against accessing dangling pointers: it's just not something the programmer has to worry about at all.
Rejecting cricitisms of C's safety inadequacies with "just code better"/"just learn the language better" doesn't work in practice: there have been too many high-profile vulnerabilities in C software, many of which would've been much harder to trigger in other languages.
So we can make everyone a better programmer, or we can make better languages, or we can throw in the towel and say things are good enough. I think he's suggesting the correct path.
Examples of dangerous UB in C always use deliberately sloppy code for pedagogical reasons. For real-world examples of problems, look at the CVE database.
I thought the promise of computers is that we didn't need to have smart people working on repetitive, boring, error-prone jobs.
And yes, there's a line between languages like C, Ruby, Python, Objective-C, etc. on one hand that don't actively try to make bugs hard, and Ada, Rust, Haskell, Ur, etc. on the other. That line is not particularly lined up with something like interpreted vs. compiled or old vs. new, and if you look for the line there, you won't see it.
Much less those who can do same with C++.
But when you have larger teams, it gets even harder. People just think so differently and misunderstand intentions without realizing.
I did think I could do that in my twenties. 15+ years later I have a lot more respect for C.
If you're really really lucky, your coworkers will only write sloppy code by accident. But unless you're only working on toy projects, statistics will catch up to you and sloppy code will happen. To err is human.
By NASA standards, I suspect most of your code has been written "sloppy". As has most of mine.
> but programming languages aren't safe, only code can be safe, and that depends entirely on the developer.
Languages can be safe in the sense that they can force code to be safe in specific ways, or at least warn you better with unsafe opt-ins or better static analysis.
We agree that the developer is to blame for the thousands of overflow CVEs out there.
One developer recognizes they're not an infallible robot, nor are their coworkers, nor is the new intern they're about to hire, and uses the tools at their disposal - static analysis, "safe" languages, etc. - to catch and fix some large percentage of certain mistakes they, and those they work with, make.
Another developer scoffs at the first for "blaming their tools" and tries to avoid mistakes with sheer willpower. By not setting up static analysis, maybe they save enough time to do an additional 10-20 code reviews over the course of the project.
All else being equal, who will end up with safer code?
> Yes, it's "easier" to introduce some bugs in C than Ruby (or Go, or whatever), but that's because whoever wrote that code with the bug didn't know C well enough.
If this is true, nobody knows C well enough. Find me a programmer who's written a sufficiently large C project without a single bug, and I will worship them as a living god.
> Is that C's fault? Same can be said about any language, really.
I don't care about fault, per se. But sure, let's blame C. And every other language. Let's not blind ourselves against their faults, and the possible ways we might improve them, and the possible ways we might adapt ourselves to them.
Let's not saddle ourselves with stone axes for the rest of our lives.
> If you don't know that String#match returns nil on unsuccessful matches and try to call MatchData#[], you'll get a NPE (something along the lines of "undefined method `[]' for NilClass"). This is very similar to dereferencing a NULL pointer in C[1].
Hence the point of talks such as "Null References: The Billion Dollar Mistake", and why some languages are designed to avoid letting you access potentially null/nil/nothing variables without checking that they aren't first.
What about C++ (which adds RAII, which I believe is indispensable for writing correct code, especially over C) or Rust, which adds much better memory correctness? I'm in agreement where the article says,
> C, and derivatives like C++, is a very dangerous language the write safety/correctness critical software in, and my personal opinion is that it is almost impossible to write security critical software in it
Though I believe it can be done in C++, with some discipline (but much less than C would require).
> As a systems programmer (I write SCSI driver code in C),
I think SCSI driver code counts as a niche application
> I can't overemphasize how important it is to be able to address memory as a flat range of bytes, regardless of how that memory was originally handed to me. I need to have uint8_t* pointers into the middle of buffers which I can then typecast into byte-aligned structs.
My understanding is that C doesn't generally allow this; that's what the strict aliasing rule is, and what's "wrong"[1] with several of the examples in the article. IIRC, you can get a [unsigned] char * into a struct (but why?[2]), but attempting to cast a char * to a struct foo * is forbidden.
(Of course, with amends to the thread's original purpose, which is asking what the common layman understands / uses / depends on. Type aliasing is not well understood in my opinion. I'm not entirely confident I've got it right in this post.)
> If your memory manager would not allow this or would move this memory around, that's a non-starter.
(same comments about Rust/C++)
> We never use 'int', by the way. We use the C99 well-defined types in stdint.h. Could this value go greater than a uint32_t can represent? Make it a uint64_t. Does it need to be signed? No? Make sure it's unsigned. A lot of what he's complaining about is sloppy code.
In my experience, this is a rare thing; especially while interviewing, I find the majority of candidates — claiming to be most comfortable in C (we allow language of choice, in the hopes that you choose your strongest!) — don't know what `size_t` is.
[1]: The array copy code is correct, but the author is lamenting optimizations that cannot be taken unless we assume the pointers don't alias; the int-to-float code is UB (hence why he writes "miscompile" in quotes; it's UB, so by definition there's no wrong output (though an error might be nice); this is also why "obvious" is in quotes: humans know what the programmer meant, but what the programmer wrote is UB; I think this is telling about C: human expectation and the language don't align, from a language-design standpoint, this is not good).
[2]: most of the time I see people reaching for a char-pointer-into-a-struct, or cast-char-pointer-to-struct, they're short circuiting actually decoding some I/O byte-stream into an in memory data structure. This is not portable, unless — maybe — if you do "packed" structures (which is still not portable, I believe), but then you're sacrificing performance by potentially having unaligned members in the struct (which are harder for the processor to deal with, and might require multiple (e.g., MIPS) or unaligned (e.g., x86, amd64)) loads/stores.
But surely, we can design a language that has no undefined behavior, without substantial deviations from C's syntax, and without massive performance penalties. This language would be great for things that prize security over performance.
And the trick is, we don't need to rewrite all software in existence in a new language to get here! C can be this language, all we need is a special compilation flag that replaces undefined behavior with defined behavior. Functions called inside a function's arguments? Say they evaluate left-to-right. Shift right on signed types? Say it's arithmetic. Size of a byte? Say it's 8-bits. memset(0x00) on something going out of scope? If the developer said to do it, do it anyway. Underlying CPU doesn't support this? Emulate it. If it can't be emulated, then don't use code that requires the safe flag on said architecture. Yeah, screw the PDP-11. And yeah, it'll be slower in some cases. Yes, even twice as slow in some cases. But still far better than moving to a bytecode or VM language.
And when we have guaranteed behavior of C, we can write new DSLs that transcode to C, without carrying along all of C's undefined behavior with it.
You want to talk about writing in higher-level languages like Python and then having C for the underlying performance critical portions? Why not defined-behavior C for the security-critical and cold portions of code, and undefined-behavior C for the critical portions?
Maybe Google wouldn't accept the speed penalty; but I'd happily drop my personal VPS from ~8000 maximum simultaneous users to ~5000 if it greatly decreased the odds of being vulnerable to the next Heartbleed. But I'm not willing to completely abandon all C code, and drop down to ~200 simultaneous users, to write it in Ruby.
…Including undefined behavior around memory allocation, in particular use-after-free?
What to do about that is the big question, in my mind. Other forms of UB can mostly be patched up straightforwardly with a clean design (though there are some tough questions around bounds checks). But when it comes to UAF, there are basically three ways you can go about this and still remain a runtimeless systems language:
1. Compromise on "no UB" for use-after-free. UAF remains undefined behavior. Some variants of Ada with dynamic memory allocation have this, and I believe many Pascals did this. It's a popular approach in many new systems languages, like Jonathan Blow's Jai.
2. Disallow dynamic allocation. This is the approach taken by SPARK and other hardened variants of Ada.
3. Allow dynamic allocation, but statically check it with a region system. This is Rust's approach. Eliminating memory safety problems in this way while avoiding a GC is pretty much unique to that language, though it's obviously influenced by many other systems that came before it (C++, Cyclone).
All of the options have serious downsides. Option (1) opens you up to what has become, in 2015, a very common RCE vector. Option (2) is very limiting and pretty much restricts your language to embedded development. Option (3) has large complexity and expressiveness costs (though once you've paid the cost you can get data race freedom without any extra work, which is nice). Altogether it's a really difficult problem with tough tradeoffs all around.
There are obviously going to be limits to what can be done. If you access beyond memory, you get "bad data" if the address is mapped by the OS, or a crash if it's not. That's a clear bug, and we can't make C a language that is incapable of producing programs with bugs. I don't really think of this as "undefined" ... we define very clearly that one of two things happens, based on the OS' memory layout. That's very different from GCC's understanding, where undefined == "if I want to have the program upload a cat picture to Reddit instead of shift a signed integer right, then that's what I'll do." (facetious, but you get the idea. Many of GCC's 'optimizations' cause outright security vulnerabilities, and defy all logic, like deleting chunks of code entirely.)
We want the most logical thing to happen when a user does something, not a completely unexpected thing just because it happens to make some compiler benchmark test look a little better.
> Other forms of UB can mostly be patched up straightforwardly with a clean design
I'm betting there aren't any C programmers out there that know 100% of the behaviors that are undefined. I've been programming for 18 years, and I got bit the other day because I had "print(sqlRow.integer(), ", ", sqlRow.integer());" ... where the .integer() call incremented the internal read position. MinGW decided to evaluate the second call first, and then the first one, so my output ended up backward. You may think that one's obvious, just like I might think that a shift by more bits than the integer type holds being undefined is obvious, but there are people that would be surprised by both.
Stating that function arguments evaluate left-to-right, just like "operator," does in expressions, would be an infinitesimal speed hit on strange systems, and no speed hit at all on modern systems that can just as easily use an indirect move to set up the stack frame.
And if you have a processor that can't do arithmetic shift right, which would be extremely rare, then generate that processor's equivalent of "((x & m) ^ b) - b" after the shift.
If overflow is defined to wrap around then it's potentially an infinite loop (take N == MAXVALUE). With overflow defined as UB you can say the loop executes exactly N times (because you're not allowed to write code that overflows).
So UB is both bad and a source of power :)
But in the case of C, that is what it is about since unsigned integers have defined behavior, so you can only have UB and the optimizations when you use a signed integer.
Nothing says "buggy port of Linux app to Windows" faster than "#pragma warning (disable: 4018)".
Clang is catching up. GCC doesn't seem to care as much. Given all the bitching about undefined behavior you'd think they'd up the warnings.
Compilers are programs too, y'know. We can define (and gasp, re-define) behavior.
The majority of that message is pretty well said, but this particular part leaves me cold. The problem isn't that 'int' is the default type, not 'long', nor is it that array indexing isn't done with iterators. (Ever merged two arrays? It's pretty clear using int indexes or pointers, but iterators can get verbose. C++ does a very good job, though, by making iterators look like pointers.) The problem is that, in C, the primitive types don't specifically describe their sizes. If you want a 32-bit variable, you should be able to ask for an unsigned or signed 32-bit variable. If you want whatever is best on this machine, you should be able to ask for whatever is word-sized. Unfortunately, C went with char <= short <= int <= long (, longlong, etc.); in an ideal world, 'int' would be the machine's word size, but when all the world's a VAX, 'int' means 32-bits.
That is one of the major victories with Rust: most primitive types are sized, with an additional word-sized type.
-- Chris Lattner
Yes, a very long time. Modula-2 was born in 1978, but we can go back to Algol and Lisp even.