The Most Expensive One-byte Mistake (2011) (opens in new tab)

(queue.acm.org)

60 pointsfleaflicker12y ago53 comments

53 comments

37 comments · 12 top-level

yongjik12y ago· 7 in thread

Yeah, if we only used strings marked with 2-byte integers, everybody would have been happy, because 64kb string is enough for everyone. (And let's be realistic, nobody sane would have chosen 4-byte string length back in early 70s.)

So, if we went down the pass, what will we have? All the fun of having "legacy" APIs that seem to work but internally only accept strings up to 64kb length and mysteriously chop off excess bytes when you least expect it. It's Y2K problem all over again.

And just when you finally think you're over with it, memory is cheaper again, size_t is 64bit, and someone invariably wants to store a binary blob >4G as string. Fun time again.

Have we forgotten how much trouble we went through in the 90s to handle memory in x86 "640k is enough for everybody" architecture?

dmethvin12y ago

This is similar to the "kill Hitler" time travel joke [1], it's easy to say things would be better but all we know for sure is that they would be different. Instead of `char` we'd have strings that were `struct` and we'd STILL have a ton of different string formats because of lengths and character formats. (Bonus problem: are the lengths in bytes, or in chars?)

[1] http://www.tor.com/stories/2011/08/wikihistory

acchow12y ago

You can have a multi-byte chained marker.

Use the first 7 bits of the byte as significant digits, the last bit as a boolean (the next byte is String payload or another byte for string length).

wolf550e12y ago

https://developers.google.com/protocol-buffers/docs/encoding...

TheSoftwareGuy12y ago

Just store a pointer to the first character and a pointer to the last one. to get the length you just subtract the two.

simias12y ago

That's exactly equivalent to having a "size" parameter with the same size as the pointer, except you have to use a substract instruction when you want to get the length of the string, so I'd say it's inferior to just storing the length of the string.

For instance, if you copy a string you also have to update the end pointer instead of just copying the size attribute in bulk. And you get the same disadvantages of non-portable strings, different representations depending on the architecture/endianess etc...

I completely agree with the OP, there's no perfect solution. If addr + len was truly superior I'm sure we'd see

    struct string { long len; char s[]; };

or for your version

    struct string { char *endptr; char s[]; };

everywhere. And the C standard library would have evolved along with it.

Out of the top of my head the only thing that makes '\0' terminated strings special in C is that it's the way string literals are represented. It would be trivial to recode all of string.h using addr + len instead of nul terminated.

3 more replies

acchow12y ago

I love this solution, but it probably would have made the move from 32-bit to 64-bit harder.

1 more reply

GnarfGnarf12y ago

That's exactly what I was thinking: a design based on a 16-bit length field would have been even more of a nightmare to migrate to 32, then 64 bits.

Like what Churchill said about democracy: NUL-terminated is a terrible solution, but it beats whatever's in second place.

millstone12y ago· 6 in thread

NUL terminated strings were the right decision for C. They’re certainly much simpler than length fields.

Consider using a length field. How big should that field be? If it's fixed size, you introduce complications regarding how big a string you can represent, and differences in field sizes across architectures. If it's variable-sized (a la UTF-8), then you've added different complications: you would need library functions to read and write the length, to get access to the string contents, to calculate the amount of memory required to hold a string of a given size, etc. Very much not in the spirit of C.

Next, what endianness should that field have? NUL terminated strings have no endianness issues: they can be trivially written to files, embedded in network packets, whatever. But with a length field, we either need to remember to marshall the string, or allow for the length field to not be in native byte order. Neither is a pleasant prospect, especially for a 1970’s C programmer.

Also, consider C-style string parsing, e.g. strtok/strsep. These could not be implemented with length-field strings.

Explicit length is better when you have an enforced abstraction, like std::string, but at that point you’re not writing in C. If you have to pick an exposed representation, NUL termination is much better than Pascal-style length fields.

So what was the “one-byte mistake?” The article says that it was saving a byte by using NUL termination instead of a two-byte length field. Had K&R not made that “mistake,” we would be unable to make a string longer than 65k - a far more serious limitation than anything NUL termination imposes!

K&R got it right.

astrodust12y ago

UTF-8 got it right with having a variable-length byte representation of numerical values. Seven-bit values unaffected. Longer values use more bytes as necessary.

The C approach takes a whole different philosophy. You want a "string"? NULL terminated. Simple. You want a buffer? Do it yourself.

ScottBurson12y ago

No one doubts that there were advantages to NUL-terminated strings, but against them you have to weigh the many thousands of security holes that were thereby created.

1 more reply

TheSoftwareGuy12y ago

instead of a length field, use a pointer to the last character. the length is the difference of these two pointers. The maximum string length is the size of your address space. Problem solved.

thedufer12y ago

The problem is hardly solved. Your string length computation is already wrong - the length is the difference between those pointers plus one.

_kst_12y ago

Hypothetically, C could use the first `sizeof (size_t)` bytes to store the length. Endianness shouldn't be much of an issue; just use the endianness of the current machine. You don't typically write the NUL terminator when writing a string to a file; similarly, you wouldn't typically write the length field when writing a counted string to a file.

I agree that NUL-terminated strings were the right decision at the time (and aren't that bad now), but there are sane ways to do counted strings.

millstone12y ago

You piqued my interest, and it looks like NUL-terminated strings in files is actually common. tar, ELF, JPEG, gzip all use them. When the serialization format matches the in-memory format, you can mmap directly into a struct, which is fast and convenient.

1 more reply

rw_grim12y ago· 5 in thread

So to be "safe" and "secure" we can only have strings 256 characters long, or we need to waste a few bytes repeatedly for short strings. Sounds like the UTF-8 vs UTF-16/32 debate..

kevingadd12y ago

The reality is that null-terminated strings are dramatically more expensive than strings with a length counter in every regard other than memory usage, and the memory usage overhead from storing a length value is utterly miniscule compared to the actual size of the string. Even if you ignore all the secondary costs that result from the decision to use null-terminated strings, they're just poor engineering. There are far better ways to save a few bytes.

(By secondary costs I mean things like the myriad bugs caused by null-terminated strings, the severe performance penalties involved in copying and manipulating them, the unfortunate implications they have for file formats and network protocols, etc.)

TheLoneWolfling12y ago

My "ideal" string would be to store it as a UTF-8 rope, with the additional restriction (doesn't change the interface any) that all characters within a node in the rope have the same length. (You can use overlong encodings internally if it makes sense (One single-byte character in a bunch of longer characters), which is a microoptimization that will in some cases save a few bytes.)

I'd also treat a character + combining characters as a single character.

1 more reply

jevinskie12y ago

A length field really is insignificant. On a 64-bit machine, a 4GB max length field is half the size of the pointer to the string itself! C++ STL strings already use a length specifier and I don't think anyone is complaining about performance because of it.

arnehormann12y ago

Length could be encoded differently, as a varint. As long as the highest bit is set, the next byte is also part of the length - just left-shift the result so far by seven and add the 7 lower bits, as soon as the highest bit is 0 we have the final length. The processing overhead is low, 127 bytes only cost 1... Not such a big issue.

callesgg12y ago

The speedup would be well worth it by being able to tell the mmc to move 100 bytes from position 20000 to position 40000 instead of give me 8 bytes from position 200000 then check evry one of the bytes then ask the mmc again give me 8 bytes from position 20008 and so on.

If I got to chose in this day and age I would define a c string/array as a pointer to a long int and after the long int we have or data.

crashandburn412y ago· 3 in thread

This page won't load for me and neither will googles webcached version[1], does anyone have a version of this that I can see?

[1] http://webcache.googleusercontent.com/search?q=cache:http://...

HillRat12y ago

This (http://cacm.acm.org/magazines/2011/9/122797-the-most-expensi...) should work.

crashandburn412y ago

Thanks

ojbyrne12y ago

The text-only google cache version worked for me: http://webcache.googleusercontent.com/search?q=cache:http://...

radiospiel12y ago· 2 in thread

Well, strings without an explicit length field allow for things like strstr(3) or prefix parsing without performance penalties due to reallocating memory.

kevingadd12y ago

Blatantly incorrect. Try thinking over how you might implement those two operations on a string with a length field and see if you can figure out why your statement is wrong.

RogerL12y ago

That came across as very condescending to me. Could you tell us how you would implement this?

I for one do not see how to implement it.

3 more replies

gcb012y ago· 1 in thread

Oh the irony of history.

On the week that str+len was abused left and right, someone surfaces to the frontpage an article about how str+NUL is wrong and everyone should use str+len.

quotient12y ago

Now that you point it out, that's actually rather amusing.

orvado12y ago· 1 in thread

Does anyone understand what the author meant by the following statement:

${lang} is the language of the future

This looks like a macro for substitution, but maybe its some hip new term I've never encountered. An actual language or just a placeholder for a language that hasn't been chosen yet?

noobiemcfoob12y ago

I think he means to imply that whoever is making the statement would substitute ${lang} with their language of choice as the successor of C.

TomMasz12y ago

A lot of programming decisions were made to save a byte here and there. It's easy to point at them today and say they're "bad", but at the time they were the absolutely correct thing to do. It's hard to imagine now but not saving that byte could mean your program wouldn't fit into RAM. Try telling your management in the 1960s that your program won't load because it's "properly coded" and see how far you get.

What we've failed to do is ever revisit those decisions and change them where we've identified problems. Yes, you can probably compile (with warnings) files from UNIX v7, but we pay for that compatibility. But there's no question designing, building and maintaining a libc alternative is a colossal undertaking and not likely to happen on a whim. So here we are.

gumby12y ago

When I was at PARC the Mesa guys (who had counted strings) did some analysis and (at least in those days) the counted strings ended up being, in aggregate, faster. I suspect the advantage would be even greater these days since memory allocation was a bigger deal back then.

I wonder if you could do this compatibly in the compiler by adding another primitive type (counted string) which had the length in the bytes before the start of the null-terminated string. You'd need a new type because various routines in the standard library would have to invisibly have two versions for counted and non-counted strings (since if you incremented a string pointer, or used a function like strchr, you'd have to treat it as a regular char). "Safe" code would use a different call (say, cstrchr) that returned an index instead of a char. The compiler could optionally warn on unsafe "legacy" calls as it can with strcpy instead of strncpy.

cliveowen12y ago

It's all true, but then again, everything would be better if we'd start from scratch today. Compromises made to tip-toe around technology limitations are what adds complexity to most of today's software, but even tomorrow's software will be influenced by today's limitations. It's best not to dwell on the past.

bananas12y ago

Yeah because strings with a length prefix/field are just as secure!

   200,"STR"

We know where that got us...

Programming 101, rules 1&2:

1 - never trust your inputs

2 - always check your invariants.

ithinkso12y ago

With NULL terminated strings it also was simpler to serialize it. If str+len was a standard now we would have 13 more serialization standards.

j / k navigate · click thread line to collapse

53 comments

37 comments · 12 top-level

yongjik12y ago· 7 in thread

And just when you finally think you're over with it, memory is cheaper again, size_t is 64bit, and someone invariably wants to store a binary blob >4G as string. Fun time again.

Have we forgotten how much trouble we went through in the 90s to handle memory in x86 "640k is enough for everybody" architecture?

dmethvin12y ago

[1] http://www.tor.com/stories/2011/08/wikihistory

acchow12y ago

You can have a multi-byte chained marker.

Use the first 7 bits of the byte as significant digits, the last bit as a boolean (the next byte is String payload or another byte for string length).

wolf550e12y ago

https://developers.google.com/protocol-buffers/docs/encoding...

TheSoftwareGuy12y ago

Just store a pointer to the first character and a pointer to the last one. to get the length you just subtract the two.

simias12y ago

I completely agree with the OP, there's no perfect solution. If addr + len was truly superior I'm sure we'd see

    struct string { long len; char s[]; };

or for your version

    struct string { char *endptr; char s[]; };

everywhere. And the C standard library would have evolved along with it.

3 more replies

acchow12y ago

I love this solution, but it probably would have made the move from 32-bit to 64-bit harder.

1 more reply

GnarfGnarf12y ago

That's exactly what I was thinking: a design based on a 16-bit length field would have been even more of a nightmare to migrate to 32, then 64 bits.

Like what Churchill said about democracy: NUL-terminated is a terrible solution, but it beats whatever's in second place.

millstone12y ago· 6 in thread

NUL terminated strings were the right decision for C. They’re certainly much simpler than length fields.

Also, consider C-style string parsing, e.g. strtok/strsep. These could not be implemented with length-field strings.

K&R got it right.

astrodust12y ago

UTF-8 got it right with having a variable-length byte representation of numerical values. Seven-bit values unaffected. Longer values use more bytes as necessary.

The C approach takes a whole different philosophy. You want a "string"? NULL terminated. Simple. You want a buffer? Do it yourself.

ScottBurson12y ago

No one doubts that there were advantages to NUL-terminated strings, but against them you have to weigh the many thousands of security holes that were thereby created.

1 more reply

TheSoftwareGuy12y ago

instead of a length field, use a pointer to the last character. the length is the difference of these two pointers. The maximum string length is the size of your address space. Problem solved.

thedufer12y ago

The problem is hardly solved. Your string length computation is already wrong - the length is the difference between those pointers plus one.

_kst_12y ago

I agree that NUL-terminated strings were the right decision at the time (and aren't that bad now), but there are sane ways to do counted strings.

millstone12y ago

1 more reply

rw_grim12y ago· 5 in thread

So to be "safe" and "secure" we can only have strings 256 characters long, or we need to waste a few bytes repeatedly for short strings. Sounds like the UTF-8 vs UTF-16/32 debate..

kevingadd12y ago

TheLoneWolfling12y ago

I'd also treat a character + combining characters as a single character.

1 more reply

jevinskie12y ago

arnehormann12y ago

callesgg12y ago

If I got to chose in this day and age I would define a c string/array as a pointer to a long int and after the long int we have or data.

crashandburn412y ago· 3 in thread

This page won't load for me and neither will googles webcached version[1], does anyone have a version of this that I can see?

[1] http://webcache.googleusercontent.com/search?q=cache:http://...

HillRat12y ago

This (http://cacm.acm.org/magazines/2011/9/122797-the-most-expensi...) should work.

crashandburn412y ago

Thanks

ojbyrne12y ago

The text-only google cache version worked for me: http://webcache.googleusercontent.com/search?q=cache:http://...

radiospiel12y ago· 2 in thread

Well, strings without an explicit length field allow for things like strstr(3) or prefix parsing without performance penalties due to reallocating memory.

kevingadd12y ago

Blatantly incorrect. Try thinking over how you might implement those two operations on a string with a length field and see if you can figure out why your statement is wrong.

RogerL12y ago

That came across as very condescending to me. Could you tell us how you would implement this?

I for one do not see how to implement it.

3 more replies

gcb012y ago· 1 in thread

Oh the irony of history.

On the week that str+len was abused left and right, someone surfaces to the frontpage an article about how str+NUL is wrong and everyone should use str+len.

quotient12y ago

Now that you point it out, that's actually rather amusing.

orvado12y ago· 1 in thread

Does anyone understand what the author meant by the following statement:

${lang} is the language of the future

This looks like a macro for substitution, but maybe its some hip new term I've never encountered. An actual language or just a placeholder for a language that hasn't been chosen yet?

noobiemcfoob12y ago

I think he means to imply that whoever is making the statement would substitute ${lang} with their language of choice as the successor of C.

TomMasz12y ago

gumby12y ago

cliveowen12y ago

bananas12y ago

Yeah because strings with a length prefix/field are just as secure!

   200,"STR"

We know where that got us...

Programming 101, rules 1&2:

1 - never trust your inputs

2 - always check your invariants.

ithinkso12y ago

With NULL terminated strings it also was simpler to serialize it. If str+len was a standard now we would have 13 more serialization standards.

j / k navigate · click thread line to collapse