(~~Technically~~ Optionally, C11 has strcpy_s and strcat_s which fail explicitly on truncation. So if C11 is acceptable for you, that might be the a reasonable option, provided you always handle the failure case. Apparently, though, it is not usually implemented outside of Microsoft CRT.)
edit: Updated notes regarding C11.
But strings in BASIC are so simple. They just work. I decided when designing D that it wouldn't be good unless string handling was as easy as in BASIC.
"Theoretically" is the word you're looking for: they're part of the optional Annex K so technically you can't rely on them being available in a portable program.
And they're basically not implemented by anyone but microsoft (which created them and lobbied for their inclusion).
C is semantically so poor, I find it hard to understand why people use it for new projects today. C++ is over complicated but at least you can find a good subset of it.
Strings in C are more like a lie. You get a pointer to a character and the hope there is a null somewhere before you hit a memory protection wall. Or a buffer for something completely unrelated to your string.
And that's with ASCII, where a character fits inside a byte. Don't even think about UTF-8 or any other variable-length character representation.
In fairness, the moment you realize ASCII strings are a tiny subset of what a string can be, you also understand why strings are actually very complicated.
One of the big problems with C programmers is they often neglect to check for and handle those failure cases. Did you know that printf() can fail, and has a return value that you can check for error? (Not you, personally, but the "HN reader" you) Do you check for this error in your code? Many of the string functions will return special values on error, but I frequently see code that never checks. Unfortunately, there isn't a great way to audit your code for ignored return values with the compiler, as far as I know. GCC has -Wunused-result, but it only outputs a warning if the offending function is attributed with "warn_unused_result".
I'm not a huge fan of using return values for error checking, but we have the C library that we have.
But also, "strings" and "time" are actually very complex concepts, and these functions operate on often outdated assumptions about those underlying abstractions.
C99 came so very very close with VLAs. You can declare a function like:
int main(int argc, char *argv[argc]) { ... }
But C99 requires the compiler to discard the type annotations and treat the declaration as equivalent to: int main(int argc, char **argv) { ... }
Imagine a world where the C string functions were declared as: char *strndup(s, n)
const char *s[n];
size_t n;
{
/* now we can do sizeof(s) and bounds checking! */
}
(You'd have to use K&R style declarations to get around the fact that the pointer argument comes before the length argument, alas.)Edit: and then C11 made VLA support optional, since the feature didn't get used much, because the feature was only half-baked to begin with... sigh.
Even in safer languages such as Rust, there are often quæstions as to why certain string operations are either impossible, or need to be quite complicated for a rather simple operation and are then met with responses such as “*Did you know that the length of a string can grow from a capitalization operation depending on locale settings of environment variables?
P.s.: In fact, I would argue that strings are not necessarily all that complicated, but simply that many assume that they are simpler than they are, and that code that handles them is thus written on such assumptions that the length of a string remain the same after capitalization, or that the result not be under influence of environment variables.
I remember thinking about setting the high bit to denote the end of string to save space.
Nowadays the binary for "hello world" might be as big as a whole operating system of the past.
(though honestly I can't recall the size of the OS on a boot floppy, but the original floppies were 160k)
Funny mind thing to forget to increment counters each year.
It has nothing to do with null termination.
And that uninitialized memory is not self-describing in any way in the C language. Which is that way in machine language also.
This is a problem you have to bootstrap yourself somehow if you are to have any higher level language.
The machine just gives you a way to carve out blocks of memory that don't know their own type or size. C doesn't improve on that, but it is not the root cause of the situation. Without C, you still have to somehow go from that chaos to order.
Copying two null terminated strings into an existing null-terminated string can be perfectly safe without any size parameters.
void replace_str(char *dest_str, const char *src_left, const char *src_right);
If dest_str is a string of 17 characters, we know we have 18 bytes in which to catenate src_left and src_right.This is not very useful though.
Now what might be a bit more useful would be if dest_str had two sizes: the length of string currently stored in it, and the size of the underlying storage. This particular operation would ignore the former, and use the latter. It could replace a string of three characters with a 27 character one.
"What? You mean I can type an arbitrary string and it works? I don't need to worry about terminators or the amount of memory I've allocated? You can concatenate two strings with +?!? What is this magic?"
I still love C, but I'd do my best not to have to write anything serious with it again.
Compared to the alternative (straight assembler) at the time as a systems programming language, C is a massive step up.
Also, the UNIX way was independent processes, so the APIs did not need to be thread safe, as there was no threading in the target architectures.
Now given the massive amount of existing C out there from the time of such architectures, you either have to move the API and language on to make it incompatible with existing code, or support the old baggage. The language has kept compatibility, and in this case, the github peeps have deprecated APIs using macros, so it's a reasonable approach.
An alternative approach would be to move the language on, but by it's nature it won't be compatible with C, so you give it a new name. You call it things like go, or rust, or swift. These are all C with the dangerous bits removed. It'll be interesting in 40 years time to see if people are having the same conversation about these languages - 'OMG, how did people write stuff in rust? It can't cope with [insert feature of distributed quantum computing]. It's really scary'
I've been coding in JS on a daily basis for more than 10 years and today I learned there is a `with` statement in JS.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...
Edit: well, seems like it's been deprecated/forbidden since ES5 (2009), so it makes sense I've never seen it.
Also, I just want to remind you that JS isn't just React. There are plenty of libraries written in C that introduce breaking changes over the course of 3 years. Nothing will stop people from finding ways to complain about JS though, I know. The hate-boner is very real.
I think stuff has kinda gotten better, but while Unicode had emoji to kinda save the day, dates never had this moment and we're still suffering through major messes on a daily basis because of it.
C's string manipulation functions are a regular source of the worst vulnerabilities in software.
Even if they're in the same category of legacy cruft, they're not even remotely in the same magnitude of consequences.
It's absolutely true that decades ago the C community was complacent, but it's not true now. Source: I taught secure coding in C/C++ in the 00s.
On BSDs and macOS you're always SOL because the syscall api isn't stable and only the C wrappers are.
It's easy to survive: just don't crash. :)
And, functions aside, it's trivial to write a C program that bombs out without calling any functions at all, safe or otherwise.
It's a language from a different era, for sure. Back then no one had the computing power to build Rust. And remember that before C, they were writing Unix in assembly language. So sprintf() was a big step up!
Why can't we just have some nice structures instead?
struct memory {
size_t size;
unsigned char *address;
};
enum text_encoding { TEXT_ENCODING_UTF8, /* ... */ };
struct text {
enum text_encoding encoding;
struct memory bytes;
};
All I/O functions should use structures like these. This alone would probably prevent an incredible amount of problems. Every high-level language implements strings like this under the hood. Only reason C can't do it is the enormous amount of legacy code already in existence...1. https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f02...
Recall that during the rise of C, people were writing machine code on punch cards. Assembly -> Machine code has far more footbullets than C, it is a tradeoff between hand holding and tiny fast code.
Wow, this blew up.
To all the people popping off about how great other languages are, tell me: when will we see the Unreal Engine written in Python, or Pascal, or Algol, or Rust, or Go... the next big step is WebASM (or .cu), and that's way more footbullet-y than C. And what is the native language all of your sub-30 year old interpreted languages were written in? Thank you!
I know C/C#/Python/Rust/Javascript.
After a decade of using C I am still not totally sure if I didn't dangle a pointer somwhere in precisely the wrong way to create havoc. And yeah, that means I have to get better, etc. But that is not the point. The point is, that even with a lot of experience in the language you can still easily shoot yourself into the foot and don't even notice it.
Meanwhile after a month of using Rust I felt confident that I didn't shoot myself in the foot, because I know what the compilers e.g. ownership guarantuees. While in C shooting myself into the foot happen quite often in Rust I would have to specifically find a way to shoot myself into the foot without the compiler yelling at me, and quite frankly I havent found such a way yet.
Javascript is odd, because the typesystem has quite a few footguns in it. This is why such things like Elm or Typescript exist: to avoid these footguns.
I don't want to take away from the accomplishments of C, and I still like the language, but to claim it is equally likely in all languages to shoot yourself into the foot is not true.
The “C competing with assembly” meme was very specific to microcomputer game and operating system development, not more general microcomputer application development, and not to minicomputer or mainframe development.
Or Fortran, Algol, Lisp, Cobol, Basic, Pascal, ...
He had some bug where in one place it returned to the start of the string, executed it, and kept going. The end result just happened to be a nop. Had been like that in production for a couple of years.
The reason why C won had little to do with its advantages as a language over the competitors. It just happened to be the systems language for Unix, which was the winner in the early OS wars on microcomputers (for unrelated reasons). Once it became so established, there was a positive feedback loop: you would write portable code in C, because you knew that it was the fastest language that most platforms out there would support. And then any new platform would offer a C compiler, because they wanted to be able to run all the existing C code out there. And so, here were are.
Those of us who have always known about less dangerous 'system' languages (Pascal probably being the most popular) lament the fact that so much code got written in C instead.
It wasn't inevitable. It was preventable! It just didn't happen that way for reasons which are largely historical.
I don't work for the Rust Evangelism Strike Force, my main project is written in (as little) C (as possible), but I beg anyone who has a choice: use something else! Rust is... fine, Zig is promising. Ada still works!
Writing out the set {Python, Pascal, Algol, Rust, Go} tempts me to say uncharitable things about your understanding of the profession, but I accept you were just being snarky so I'll just gesture in the direction of how $redacted that is.
Why would a huge C++ (not C, btw) codebase with roots going back to the 90s be rewritten in any other language?
And in fact how is the language Unreal Engine written in relevant to C having footguns?
Go (golang)
C#
But please, nothing about using unsafe.
strcpy() was replaced with a safer strncpy() and in turn has been replaced with strlcpy().
The list is a ban of the less safe versions, where more modern alternatives exist.
http://the-flat-trantor-society.blogspot.com/2012/03/no-strn...
C has unsafe basic functions because the programs written then were much simpler, and this sufficed. There's decades of PL research resulting in new languages that give better guarantees than C, allowing you to worry less about wrestling with the language and more on your business logic.
> C programmers don’t trust anything, and they’re better programmers for it.
By that dime, frontend JS programmers trust things even less than C programmers, and they're even better programmers for it. \s (in reality, FE JS devs mainly wish that browser environments were more consistent and predictable, and would disagree that they are better developers because of it).
"../../../../../../../../../../../../../../../../../../../../etc/shadow" is not a file someone would ever reasonably want to access. But is there an easy way to look for nonsense paths without potentially limiting functionality, or writing more code than you wanted to? Nope.
The same footgun exists in all languages; C's design just has a hair trigger.
BTW, in macOS there are "secure bookmarks" (see NSURL docs) that are effectively capability tokens: when user drags a file, or selects it in an Open File dialog (which runs isolated from the app), the kernel creates an app-specific token that grants access to that file to the app, so it can access it beyond its sandbox.
Unfortunately, it is riddled with sharp knives that can cut you, open flames that can burn you, gas that can smother you, water than can drown you and food that can make you sick if you prepare it incorrectly.
Some react to this potential safety threat by banning the use of knives, stoves, sinks, and food from a kitchen.
Fortunately most attempts at safety just require having a microwave to prepare the frozen pizza or Uber Eats delivery.
I've seen the same question posted way too often by beginners to C. "I've created a char*. Why am I getting <random fault> when I try to write to it?"
This in turn stems from the dedicated CPU support for working with zt strings that traces all the way back to PDP-11. So what C does here is exactly what it has always been doing - it provides a thin wrapper of the existing hardware functionality.
The variadic arguments are of the same nature - they basically allow for manual call stack parsing, again something that is a level down from the application code.
It's also easy to see how an API like sprintf and scanf came about - someone's just got tired of writing a bunch of boilerplate code to print a float with N decimals aligned to the left with a plus sign. So they threw together a function call "spec" (the format string), added a call stack parsing support (va_args) and - voila - a beautifully concise print/scan interface. It is a very clever construct, you've gotta give it that.
The flip side that it required people to pay close attention how they use it, which wasn't that bold of a requirement back then. But as time went on, the average skill of C programmers went down, their use of the language did too, so more and more people started to step on the same rakes.
So, here we are. Zero-terminated strings are forbidden and va_args calls are nothing short of the magic.
If they had used a {pointer, size} pair instead, it would have avoided all of these string problems, most buffer overflows, even the GTA Online loading problem that was on HN recently.
These days (ptr,size) is probably 16 bytes -- longer than almost all words in the English language (the scrabble SOWPODS maxes out at 15). A pointer alone is 8B. Back at the dawn of C in 1970, memory was 7..8 orders of magnitude more expensive than today..(about 1 cent per bit in 1970 USD). (Today, cache memory can be almost as precious, but I agree that the benefits of bounded buffers probably outweigh their costs.)
8B pointers today are considered memory-costly enough "in the large" that even with dozens of GiB machines common, Intel introduced an x32 mode to go back to 32-bit addressing aka 4B pointers. [1] There are obviously more pointers than just char* in most programs, but even so.
Anyway, trade offs are just something people should bear in mind when opining on the "how it should be"s and "What kind of wacky drugs were the designers of language XYZ on?!!?".
[1] https://stackoverflow.com/questions/9233306/32-bit-pointers-...
Pascal, to save one byte, limited strings to length 255. Bad decision.
I think the sentinel character was the best choice in hindsight and at the time in that regard.
But I wish the xxx_s versions and strdup would have made it into the standard like 30 years ago.
It is not that there is anything intrinsically wrong with these functions. You can technically use all of them and I have been using all of them, safely, for decades.
The issue is they are huge traps to the point that in a larger piece of software one can say "well, it's just not worth it".
You can go much, much, much further than that.
In couple embedded projects I worked some of the rules were:
* dynamic allocation after application has started is banned -- any heap buffers and data structures must be allocated at the start of the application and after that any allocation is a compile time error,
* any constructs that would prevent statically calculating stack usage were banned (for example any form of recursion except when exact recursion depth is ensured statically),
* any locks were banned,
* absolutely every data structure must have size ensured, in a simple way, beyond any reasonable doubt,
etc.
Except when you have these rules in Java, the ironic counter-point is "if you are doing this much memory control yourself, you should just use C or C++ or something".
I'll keep your comment in mind next time I see that rebuttal. Thank you.
There is a bunch of misconceptions about Java. Java is actually very performant and memory allocation is generally cheaper than in C (except for inability to have good use of stack in Java). What's slow about Java is all the shit that has been implemented on top of it, but that's another story for another time.
For example, allocation in Java is basically incrementing the pointer. And deallocation for most objects is basically forgetting the object exists.
No, you don't want to "limit the use of new", that's wrong approach.
What you want is to have objects that are either permanent or last very short amount of time.
The worst types of objects are ones that have kind of intermediate lifetime ie if they are allowed to mature from eden. These cost a lot to collect.
The objects that have very short lifetime are extremely cheap to collect.
So if your function takes arguments, creates couple of intermediate objects and then never returns them (for example they were just necessary for inner working of the function) and your function does not call a lot of other heavy stuff, then it is very likely the cost of those temporary objects will be very low. Also, they tend to be allocated very close to each other and so pretty well cached.
- ternary operator("?") was strictly forbidden. One had to use full "if () {..}else {..}" syntax with comments inside each branch even if the branch was empty
- a dynamic array written in an abstract way, when used and implemented specifically for current project had to become a constant static one, with values precalculated and copy/pasted to current project source. This was a fun one to do maintenance work years later.
- magic numbers inside code was forbidden. All numbers had to be defined in a specific header, with explanation why is that number said value.
- no variable parameters. All functions to have fixed parameters
- use of macros as minimum as possible. Code review was wasted sometime on 50% time over use of macros that were not already "classic" from the project point of view
- operator overload strictly forbidden. Also overloading functions was forbidden too.
The argument was to allocate memory freely and let it pool memory as necessary. Fair enough, it was simpler and fit the standard expectation of development.
The issue is that if you talk with the allocator team they complain of not being able to fix performance issues fast enough due to allocations firing off left and right in the middle of a request.
I never realized that my view of C programming is heavily influenced by MISRA until your comment.
I know game engine programming follows a similar, perhaps unspoken, convention.
Also doesn’t the OS lie? I thought the memory wasn’t really physically assigned until first use.
In both cases the project size is small enough, or the scrutiny is high enough that the ad-hoc allocator doesn't develop. The environment is also simple enough that the memory cheats you're thinking of don't exist (or you can squash them by touching all allocated memory up front).
The goal of these rules is to improve reliability and timeliness of your application. If you intend on working around those rules to do what the rules explicitly forbid then either you or the rules are wrong.
You could maybe call filescope buffers with an size counter a dynamic memory allocation. I.e. for storing RS232 or CAN messages. Since they shrink and grow.
The important thing is that you want to know that flooding one buffer wont flood another, which malloc could result in if it was used for unrelated buffers.
That depends on the OS. Linux lies (overcommits), Windows doesn't. In embedded it's more typical to have a special OS like VxWorks or FreeRTOS that don't lie to you, or to have no OS at all (like basically every arduino project)
how do ensure that?
(It would have to be in the .c files, not the headers, might not be so clean)
> The ctime_r() and asctime_r() functions are reentrant, but have no check that the buffer we pass in is long enough (the manpage says it "should have room for at least 26 bytes"). Since this is such an easy-to-get-wrong interface, and since we have the much safer strftime() as well as its more convenient strbuf_addftime() wrapper, let's ban both of those.
(https://github.com/git/git/commit/91aef030152d121f6b4bc3b933...)
> The traditional gmtime(), localtime(), ctime(), and asctime() functions return pointers to shared storage. This means they're not thread-safe, and they also run the risk of somebody holding onto the result across multiple calls (where each call invalidates the previous result). All callers should be using their reentrant counterparts.
(https://github.com/git/git/commit/1fbfdf556f2abc708183caca53...)
https://github.com/git/git/commit/c8af66ab8ad7cd78557f0f9f5e...
It actually gives examples and a lengthy explanation and reasoning behind the ban.
If someone wants some fun, try this:
1. Slurp up all the FOSS projects that extend back to 90s or early 2000s.
2. Filter by starting at earliest snapshot and finding occurrences of strcpy and friends who don't have the "n" in the middle.
3. For those occurrences, see which ones were "fixed" by changing them to strncpy and friends in a later commit somewhere.
4. See if you can isolate that part of the code that has the strncpy/etc. and run gcc on it. Gcc-- for certain cases (string literals, I think)-- can report a warning if "n" has been set to a value that could cause an overflow.
I'm going to speculate that there was a period where C programmers were furiously committing a large number of errors to their codebases because the "n" stands for "safety."
If you are doing something like `sprintf(buffer, "%f, %f", a, b)`, yes it is tricky to choose the size of buffer frugally, but if you replace that by `ftoa` and constructing the string by hand, you are likely to introduce more bugs.
Edit: as pointed out in another post, you can do git blame to see the rationale for each ban, quite interesing.
A fun exercise you can do is put a "%s" in the format string, omit the string argument and see what happens to the stack.
I'd say the usual trap is rather the size of the target buffer, because that requires bigger static analysis guns. (I'm ignoring things like "%n", because then you're playing with fire already.)
char buf[2];
sprintf(buf, "%d", n);
This will happily write to buf[2] and beyond if n is negative or greater than 9.If you're thinking about using it, consider instead:
- strlcpy() if you really just need a truncated but
NUL-terminated string (we provide a compat version, so
it's always available)
- xsnprintf() if you're sure that what you're copying
should fit
- strbuf or xstrfmt() if you need to handle
arbitrary-length heap-allocated stringssnprintf or nul-plus-strncat do what you want, but snprintf has portability problems on overflow. Most projects I've been on rely on strlcpy (with a polyfill implementation where not available).
It may actually be a bug that I got the warning, because the range of each input was checked, and I think the compiler is supposed to be smart enough to remember that.
std::string needs some tweaks, but it can mostly be treated as a built in and it wipes out a huge set of C string issues.
However, I look at old books on C, and then I look at this list, and I wonder if it would not have been helpful to, after mentioning that a function was banned, suggest what the replacement is, even as a comment.
It's likely that the authors of this list didn't think the comments would be worthwhile for the audience (git developers).
- strlcpy() if you really just need a truncated but
NUL-terminated string (we provide a compat version, so
it's always available)
- xsnprintf() if you're sure that what you're copying
should fit
- strbuf or xstrfmt() if you need to handle
arbitrary-length heap-allocated strings
"> we provide a compat version, so it's always available
Furthermore, imagine "src" has 1Mb characters but we only want to copy the first 3 chars. The git implementation would traverse the entire 1Mb to find the length first, but a proper implementation only needs to look at the first 3 chars. So, they banned strncpy and provided a worse solution to that.
[1]: https://github.com/git/git/blob/master/compat/strlcpy.c
(strcpy is just banned because there's no bounds check, and they want to force use of strlcpy instead).
See https://developers.redhat.com/blog/2019/08/12/efficient-stri...
#pragma GCC poison printf sprintf fprintf
Turns out you just can't use them when you contribute code to the Git project. That makes sense, and seems reasonable.
Edit: wait, I can't use strcpy?! Screw that, then I'm not open sourcing my AGI!
https://github.com/git/git/blob/master/object-file.c#L1293
And currently used here (at least):
While I think such rules are a good idea it only makes sense if it is done consistently and depends on how religiously the tooling (duct-tape and "process") enforces them (even so, you're still only one `#ifdef` away from undoing that "safety"). Having GCC[1] now support static analysis is a killer feature for this type of problem.
On the other end of the spectrum we have Huawei which instead of linting their code is finding creative ways to trick auditing tools and hide such warnings from auditors:
[0] https://news.ycombinator.com/item?id=22712338
[1] https://developers.redhat.com/blog/2021/01/28/static-analysi...
The strncpy() function is less horrible than strcpy(), but
is still pretty easy to misuse because of its funny
termination semantics. Namely, that if it truncates it omits
the NUL terminator, and you must remember to add it
yourself. Even if you use it correctly, it's sometimes hard
for a reader to verify this without hunting through the
code. If you're thinking about using it, consider instead:
- strlcpy() if you really just need a truncated but
NUL-terminated string (we provide a compat version, so
it's always available)
- xsnprintf() if you're sure that what you're copying
should fit
- strbuf or xstrfmt() if you need to handle
arbitrary-length heap-allocated strings
I just did a search on the keywords 'banned' and 'strncpy' [2][0] https://lore.kernel.org/git/20180724092828.GD3288@sigill.int...
[1] https://lore.kernel.org/git/20190103044941.GA20047@sigill.in...
[2] https://lore.kernel.org/git/20190102093846.6664-1-e@80x24.or...
https://github.com/git/git/commits/master/banned.h
(Git development is done by emailing patches. Those patches include the git commit message, which we can see just by looking at the history of the file. Sometimes there's additional discussion on the ML, but the most important details are in the commit message because the git development team is very disciplined about that.)
https://github.com/git/git/commit/1fbfdf556f2abc708183caca53...
https://github.com/git/git/commit/91aef030152d121f6b4bc3b933...
It would be good to know what the commonly-accepted alternatives are.
For example: https://lgtm.com/rules/2154840805/
Much like with all other forms of effective censorship, I see this as a quick short-term "fix" with hidden long-term costs[1]. IMHO this sort of anti-thinking just leads to even worse, more dogmatic and cargo-cult, programmers who know less and less about the basics and then go on to make even more subtle errors.
Somehow the collective software industry has managed to propagate the notion that people are incapable of doing even basic arithmetic. Yet they think people are capable of creating complex systems with even more subtle behaviour? The justification would normally be because it's not directly affecting security. WTF. It's beyond stupid.
The only C function I think should be truly banned is gets(), because it is actually impossible to calculate what size of buffer it needs. That is not true of any of the others on this list.
[1] By short and long, I mean decades vs centuries.
Static analysis would probably be more robust, but way more involved.
/joke
It should be strncpy(a,b,(size_t)-1)!
- strcpy: no bounds check
- strcat: no bounds check
- strncpy: does not nul-terminate on overflow
- strncat: no major issues, probably to force usage of strlcat
- sprintf: no bounds check
- vsprintf: no bounds check
- gmtime: returns static memory
- localtime: returns static memory
- ctime: no bounds check
- ctime_r: no bounds check
- asctime: returns static memory
- asctime_r: no bounds check
The str functions all have safer alternatives. The time functions have reentrant alternatives, and/or alternatives that provide a bounds check.