undefined | Better HN

0 pointsretrac5y ago0 comments

For reasons that were never clearly articulated, the prefix approach was considered odd, backwards, and to have numerous downsides, at least where I learned C. In hindsight, I can only cringe at that attitude. Strings as added in later Pascal, about 40 years ago now, were memory safe in a way that C strings still are not.

0 comments

30 comments · 7 top-level

munchbunny5y ago· 6 in thread

The prefix approach turns the neat "strings are just character arrays are just pointers" pattern into something a lot more clunky, because now you've got this really basic data type that is actually a struct and now you have to have an opinion on how wide the length value is and short strings get a lot of memory overhead in just lengths, and so on.

In hindsight, I think the complexity is worth the safety, but I could see why it felt more elegant to use null-terminated strings at the time.

jdlshore5y ago

It's a classic case of moving the complexity from one part of the system to another. "Strings are just character arrays" seems simple and elegant, but in reality is a giant mess, because strings are not just character arrays, any more than dates are just an offset from an epoch.

Human concepts are inherently messy. "Elegant" solutions just shove the mess down the road.

lultimouomo5y ago

On the contrary, I think "Strings are just character arrays are just pointers" is the solution, not the problem. As with non-character arrays, you must always pair the pointer with a length. (I don't like the idea of prefixing the length because it prevents su stringing).

The problem is the null termination, which is not general to arrays (though it is sometimes used with arrays of pointers).

1 more reply

konjin5y ago

You're just moving the complexity from strings into unsigned integers. Your strings are limited by the size of whatever you put in the head of the string.

Sure 16 exabytes sounds like a lot today, but so did 4 billion ip addresses. Differently bad is not better.

1 more reply

bobbyi_settv5y ago

It doesn't just buy safety. It also makes it possible to include null bytes inside of strings.

kortex5y ago

This is the part that boggles my mind. It's not like fat pointers just didn't exist at the time. You need fat pointers any time you do anything with dynamic non-string binary data.

Null is always 1 byte minimum so at best you save size_t-1 bytes per string. Ignoring clever structures like LEB128 varint length.

This is a classic case of "simple is actually complex". How many billions of dollars has null terminal strings cost? Hope that 3 bytes of overhead per string saved is worth it.

NavinF5y ago

and with length+array you don't need 2 copies of so many functions (array input vs string input)

No matter how you slice it, null termination was a mistake.

kazinator5y ago· 5 in thread

Pascal strings are not inherently memory safe:

   cat_pascal_strings(pascalstr *uninited_memory,
                      pascalstr *left,
                      pascalstr *right);

how big is uninited_memory? Can left and right fit into it?

You need to design language constructs around Pascal srings to make them actually safe. Such as, oh, make it impossible to have an uninitialized such object. The object has o know both its allocation size and the actual size of the string stored in it.

What is unsafe is constructing new objects in an anonymous block of memory that knows nothing about its size.

C programs run aground there not just with strings!

   struct foo *ptr = malloc(sizeof ptr);  // should be sizeof *ptr!!

   if (ptr) {
      ptr->name = name;
      ptr->frobosity = fr;

Oops! The wrong size of allocated only the size of a pointer: 4 or 8 bytes, typically nowadays, but the structure is 48 bytes wide.

"struct foo" itself isn't inferior to a Pascal RECORD; the problem is coming from the wild and loose allocation side of things.

Working with strings in Pascal is relatively safe, but painfully limiting. It's a dead end. You can't build anything on top of it. Can you imagine trying to make a run-time for a high level language in Pascal? You need to be in the driver's seat regarding how strings work.

unnouinceput5y ago

"Can you imagine trying to make a run-time for a high level language in Pascal"

You mean like the strings in Delphi? Yeah, I can since I use them daily. Strings in Delphi nowadays are actually more like classes in java than Old Pascal strings. Then depending on your intend either get them to be arrays or old strings after linker goes over your code. Best of both worlds, and on top of it, if you really want, you can definitely shoot yourself in your leg with unsafe operations. So in the end is best of both worlds and worse of 3rd world. Though the 3rd one you really need to go out of your way to have it as bad as C strings are.

Macha5y ago

> Working with strings in Pascal is relatively safe, but painfully limiting. It's a dead end. You can't build anything on top of it. Can you imagine trying to make a run-time for a high level language in Pascal? You need to be in the driver's seat regarding how strings work.

I doubt string representation is really the blocker here since C-strings are now pretty much just used by some but not all C programmers. QString and GString and C++ std::string and Rust strings and Go strings and Java strings and so on are not null terminated

int_19h5y ago

I can totally imagine trying to make a run-time for a high-level language in any sensible Pascal dialect, such as Turbo/Borland Pascal. I mean, forget strings - that thing had syntax specifically to implement interrupt handlers, or access absolute memory addresses.

Better yet, how about Modula-2? I can't help but think that the programming language landscape would be much better if that language occupied the niche that C does today.

nrdvana5y ago

In the Pascal that I remember, strings were always 256 bytes and the first byte tracked the length, meaning they were always safe, though might get truncated. The LongString just did allocations whenever it needed, and was also safe, as long as you weren’t rolling your own pointer math.

1 more reply

Sohcahtoa825y ago

> struct foo ptr = malloc(sizeof ptr); // should be sizeof ptr!!

This is why whenever I use sizeof, I pass a type, not a variable.

throw0101a5y ago· 4 in thread

(Many of) The trade-offs were known to Richie et al; writing in 1993:

> None of BCPL, B, or C supports character data strongly in the language; each treats strings much like vectors of integers and supplements general rules by a few conventions. In both BCPL and B a string literal denotes the address of a static area initialized with the characters of the string, packed into cells. In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled `*e'. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator.

[…]

> C treats strings as arrays of characters conventionally terminated by a marker. Aside from one special rule about initialization by string literals, the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe and to translate than one incorporating the string as a unique data type. Some costs accrue from its approach: certain string operations are more expensive than in other designs because application code or a library routine must occasionally search for the end of a string, because few built-in operations are available, and because the burden of storage management for strings falls more heavily on the user. Nevertheless, C's approach to strings works well.

* https://www.bell-labs.com/usr/dmr/www/chist.html

He mentions Algol 68 and Pascal [Jensen 74].

munchbunny5y ago

Thanks for the reference!

I personally don't think that the qualitative pros/cons of the chosen approach or alternatives that we're discussing today, 30-ish years later, would be all that new to the designers of C in 1993. The difference is that we've had 30-ish years to watch those decisions play out over millions of lines of code in software running at scales and levels of complexity that programmers in 1993 could only dream of.

Also, software security was barely an issue in 1993. Today, it's a massive issue.

throw0101a5y ago

> Also, software security was barely an issue in 1993. Today, it's a massive issue.

That was him reflecting on things in 1993, but the C team designed things in ~1970. That was basically the Stone or Iron Age of computing.

retracOP5y ago

Thank you (and the others) in this thread. Very insightful, particularly with what motivated the thinking then.

CodeArtisan5y ago

We could say that a string in C is an implicit type like a list in Lisp.

lordgroff5y ago· 4 in thread

Oh Pascal, why couldn't we have had you instead.

mytec5y ago

"Oh Pascal" reminded me of a book titled, "Oh! Pascal!" by Doug Cooper. I used it to learn Pascal.

CRConrad5y ago

Best programming book I ever read. His later "Condensed Pascal" isn't quite as good -- a little too, well, condensed. Too bad that's the only one I could find to buy after having had to return Oh, Pascal to the city library.

(OK, it's hard to compare; Code Complete and other much later stuff might be just as good. Too many decades between when I read them to say for sure.)

unnouinceput5y ago

You can, it's called Delphi.

cb3215y ago

Or Nim [1]. :-) It's hard to describe succinctly, but "Pascal meets Ada meets Python meets Lisp-like syntax macros" starts to convey it. Forget operator overloading trying to square peg-round hole whatever the operator set is - Nim has user-defined operators. And a dozen other nice things.

https://nim-lang.org

JoeAltmaier5y ago· 3 in thread

Hey, languages used length,blob even when C was invented. HP Access BASIC used that kind.

It was a limitation, because they chose a byte length (to save space). So strings up to 255 characters only. It was decades before folks were comfortable with 32-bit length fields. And that still limited you to 4GB strings. In the bad old days, memory usage was king.

selfhoster115y ago

The funny thing is that you can just use the topmost bit of the length to indicate that the string length is >127, and chain as many length bytes as you want before you begin the string proper (to save space). It would be still a better encoding than a null at the end.

michaf5y ago

This way you would trade in a null-byte-terminated variable length string for essentially a null-bit-terminated variable length number (plus the remaining string). I am not convinced that this actually would be much safer.

4 more replies

Someone5y ago

Pascal compilers of the time when Pascal was popular typically had a switch that disabled range checks for speed reasons.

Such a system would effectively remove that feature. Yes, you could disable range checks when indexing into a string, but you still would have to figure out how many length bytes there are. That would only be a little bit faster than a full range check.

Because of that, I don’t see how that would have been useful at the time.

kevin_thibedeau5y ago· 1 in thread

On platforms with limited register sets, keeping length around burns a register that could be used for something else. From the mindset of assembler programmers wanting a high level language that suits them, a sentinel that doesn't consume extra machine resources was preferable. Not to mention all the precious RAM those multi-byte lengths squander.

kortex5y ago

Ok, I hadn't considered the registers thing. That almost makes sense. Almost. But variable length arrays are so useful I think it warrants adding a register just for dealing with fat pointers.

Like I get why it happened. It is just crazy how long it has stuck around.

int_19h5y ago

Strings as they existed in standard Pascal were extremely limited in how you could work with them, since it didn't have true dynamic arrays.

Strings as implemented in e.g. Borland Pascal were better. But then, the length-prefixed implementation had its own downsides. For example, it had to decide how many bits to use for length. 16-bit Pascal would generally use a single byte, and in BP at least, you could even access it as a character via S[0]. Thus, strings were limited to 256 bytes max - and because this was baked into the ABI, it wasn't something that could be easily changed later.

Hence when Delphi decided to fix it, they basically had to introduce a whole new string type, leaving the old one as is. And then they added a bunch of compiler switches so that "string" could be an alias for the new type or the old, as needed in that particular code file.

j / k navigate · click thread line to collapse

0 comments

30 comments · 7 top-level

munchbunny5y ago· 6 in thread

In hindsight, I think the complexity is worth the safety, but I could see why it felt more elegant to use null-terminated strings at the time.

jdlshore5y ago

Human concepts are inherently messy. "Elegant" solutions just shove the mess down the road.

lultimouomo5y ago

The problem is the null termination, which is not general to arrays (though it is sometimes used with arrays of pointers).

1 more reply

konjin5y ago

You're just moving the complexity from strings into unsigned integers. Your strings are limited by the size of whatever you put in the head of the string.

Sure 16 exabytes sounds like a lot today, but so did 4 billion ip addresses. Differently bad is not better.

1 more reply

bobbyi_settv5y ago

It doesn't just buy safety. It also makes it possible to include null bytes inside of strings.

kortex5y ago

This is the part that boggles my mind. It's not like fat pointers just didn't exist at the time. You need fat pointers any time you do anything with dynamic non-string binary data.

Null is always 1 byte minimum so at best you save size_t-1 bytes per string. Ignoring clever structures like LEB128 varint length.

This is a classic case of "simple is actually complex". How many billions of dollars has null terminal strings cost? Hope that 3 bytes of overhead per string saved is worth it.

NavinF5y ago

and with length+array you don't need 2 copies of so many functions (array input vs string input)

No matter how you slice it, null termination was a mistake.

kazinator5y ago· 5 in thread

Pascal strings are not inherently memory safe:

   cat_pascal_strings(pascalstr *uninited_memory,
                      pascalstr *left,
                      pascalstr *right);

how big is uninited_memory? Can left and right fit into it?

What is unsafe is constructing new objects in an anonymous block of memory that knows nothing about its size.

C programs run aground there not just with strings!

   struct foo *ptr = malloc(sizeof ptr);  // should be sizeof *ptr!!

   if (ptr) {
      ptr->name = name;
      ptr->frobosity = fr;

Oops! The wrong size of allocated only the size of a pointer: 4 or 8 bytes, typically nowadays, but the structure is 48 bytes wide.

"struct foo" itself isn't inferior to a Pascal RECORD; the problem is coming from the wild and loose allocation side of things.

unnouinceput5y ago

"Can you imagine trying to make a run-time for a high level language in Pascal"

Macha5y ago

int_19h5y ago

Better yet, how about Modula-2? I can't help but think that the programming language landscape would be much better if that language occupied the niche that C does today.

nrdvana5y ago

1 more reply

Sohcahtoa825y ago

> struct foo ptr = malloc(sizeof ptr); // should be sizeof ptr!!

This is why whenever I use sizeof, I pass a type, not a variable.

throw0101a5y ago· 4 in thread

(Many of) The trade-offs were known to Richie et al; writing in 1993:

[…]

* https://www.bell-labs.com/usr/dmr/www/chist.html

He mentions Algol 68 and Pascal [Jensen 74].

munchbunny5y ago

Thanks for the reference!

Also, software security was barely an issue in 1993. Today, it's a massive issue.

throw0101a5y ago

> Also, software security was barely an issue in 1993. Today, it's a massive issue.

That was him reflecting on things in 1993, but the C team designed things in ~1970. That was basically the Stone or Iron Age of computing.

retracOP5y ago

Thank you (and the others) in this thread. Very insightful, particularly with what motivated the thinking then.

CodeArtisan5y ago

We could say that a string in C is an implicit type like a list in Lisp.

lordgroff5y ago· 4 in thread

Oh Pascal, why couldn't we have had you instead.

mytec5y ago

"Oh Pascal" reminded me of a book titled, "Oh! Pascal!" by Doug Cooper. I used it to learn Pascal.

CRConrad5y ago

(OK, it's hard to compare; Code Complete and other much later stuff might be just as good. Too many decades between when I read them to say for sure.)

unnouinceput5y ago

You can, it's called Delphi.

cb3215y ago

https://nim-lang.org

JoeAltmaier5y ago· 3 in thread

Hey, languages used length,blob even when C was invented. HP Access BASIC used that kind.

selfhoster115y ago

michaf5y ago

4 more replies

Someone5y ago

Pascal compilers of the time when Pascal was popular typically had a switch that disabled range checks for speed reasons.

Because of that, I don’t see how that would have been useful at the time.

kevin_thibedeau5y ago· 1 in thread

kortex5y ago

Ok, I hadn't considered the registers thing. That almost makes sense. Almost. But variable length arrays are so useful I think it warrants adding a register just for dealing with fat pointers.

Like I get why it happened. It is just crazy how long it has stuck around.

int_19h5y ago

Strings as they existed in standard Pascal were extremely limited in how you could work with them, since it didn't have true dynamic arrays.

j / k navigate · click thread line to collapse