So, if we went down the pass, what will we have? All the fun of having "legacy" APIs that seem to work but internally only accept strings up to 64kb length and mysteriously chop off excess bytes when you least expect it. It's Y2K problem all over again.
And just when you finally think you're over with it, memory is cheaper again, size_t is 64bit, and someone invariably wants to store a binary blob >4G as string. Fun time again.
Have we forgotten how much trouble we went through in the 90s to handle memory in x86 "640k is enough for everybody" architecture?
Use the first 7 bits of the byte as significant digits, the last bit as a boolean (the next byte is String payload or another byte for string length).
For instance, if you copy a string you also have to update the end pointer instead of just copying the size attribute in bulk. And you get the same disadvantages of non-portable strings, different representations depending on the architecture/endianess etc...
I completely agree with the OP, there's no perfect solution. If addr + len was truly superior I'm sure we'd see
struct string { long len; char s[]; };
or for your version struct string { char *endptr; char s[]; };
everywhere. And the C standard library would have evolved along with it.Out of the top of my head the only thing that makes '\0' terminated strings special in C is that it's the way string literals are represented. It would be trivial to recode all of string.h using addr + len instead of nul terminated.
Like what Churchill said about democracy: NUL-terminated is a terrible solution, but it beats whatever's in second place.
Consider using a length field. How big should that field be? If it's fixed size, you introduce complications regarding how big a string you can represent, and differences in field sizes across architectures. If it's variable-sized (a la UTF-8), then you've added different complications: you would need library functions to read and write the length, to get access to the string contents, to calculate the amount of memory required to hold a string of a given size, etc. Very much not in the spirit of C.
Next, what endianness should that field have? NUL terminated strings have no endianness issues: they can be trivially written to files, embedded in network packets, whatever. But with a length field, we either need to remember to marshall the string, or allow for the length field to not be in native byte order. Neither is a pleasant prospect, especially for a 1970’s C programmer.
Also, consider C-style string parsing, e.g. strtok/strsep. These could not be implemented with length-field strings.
Explicit length is better when you have an enforced abstraction, like std::string, but at that point you’re not writing in C. If you have to pick an exposed representation, NUL termination is much better than Pascal-style length fields.
So what was the “one-byte mistake?” The article says that it was saving a byte by using NUL termination instead of a two-byte length field. Had K&R not made that “mistake,” we would be unable to make a string longer than 65k - a far more serious limitation than anything NUL termination imposes!
K&R got it right.
The C approach takes a whole different philosophy. You want a "string"? NULL terminated. Simple. You want a buffer? Do it yourself.
I agree that NUL-terminated strings were the right decision at the time (and aren't that bad now), but there are sane ways to do counted strings.
(By secondary costs I mean things like the myriad bugs caused by null-terminated strings, the severe performance penalties involved in copying and manipulating them, the unfortunate implications they have for file formats and network protocols, etc.)
I'd also treat a character + combining characters as a single character.
If I got to chose in this day and age I would define a c string/array as a pointer to a long int and after the long int we have or data.
[1] http://webcache.googleusercontent.com/search?q=cache:http://...
I for one do not see how to implement it.
On the week that str+len was abused left and right, someone surfaces to the frontpage an article about how str+NUL is wrong and everyone should use str+len.
${lang} is the language of the future
This looks like a macro for substitution, but maybe its some hip new term I've never encountered. An actual language or just a placeholder for a language that hasn't been chosen yet?
What we've failed to do is ever revisit those decisions and change them where we've identified problems. Yes, you can probably compile (with warnings) files from UNIX v7, but we pay for that compatibility. But there's no question designing, building and maintaining a libc alternative is a colossal undertaking and not likely to happen on a whim. So here we are.
I wonder if you could do this compatibly in the compiler by adding another primitive type (counted string) which had the length in the bytes before the start of the null-terminated string. You'd need a new type because various routines in the standard library would have to invisibly have two versions for counted and non-counted strings (since if you incremented a string pointer, or used a function like strchr, you'd have to treat it as a regular char). "Safe" code would use a different call (say, cstrchr) that returned an index instead of a char. The compiler could optionally warn on unsafe "legacy" calls as it can with strcpy instead of strncpy.
200,"STR"
We know where that got us...Programming 101, rules 1&2:
1 - never trust your inputs
2 - always check your invariants.