There are then questions about the length prefix, with a simple solution: make this a platform-specific detail and use the machine word. 16-bit platforms get strings of length ~2^16, 32 b platforms get 2^32 (which is a 4 GB-long string, which is more than 1000× as long as the entire Lord of the Rings trilogy), 64 b platforms get 2^64 (which is ~10^19).
Edit: I think a lot of commenters are focusing on the 'Pascalness' of Pascal strings, which I was using as an umbrella terminology for length-prefixed strings.
This is a special-case optimisation that I'm happy to lose in favour of the massive performance and security benefits otherwise.
Isn't length + pointer... Basically a Pascal string? Unless I am mistaken.
I think what was unsaid in your second point is that we really need to type-differentiate constant strings, dynamic strings, and string 'views', which Rust does in-language, and C++ does with the standard library. I prefer Rust's approach.
C++ strings had no choice but to copy to underlying string because of this unknown ownership and then added more ownership issues by letting you call the naked pointer within to pass it to C functions. In fact, that's an issue with pretty much every C++ container, including the smart pointers: you can just call get() an break out of the lifecycle management in unpredictable ways.
string_view came much later onto the scene and doesn't have ownership so you avoid a sometimes unnecessary copy but honestly it just makes things more complex.
I honestly think that as long as we continue to use C/C++ for crucial software and operating systems, we'll be dealing with buffer overflow CVEs until the end of time.
Plus, C-style strings allow a lot of optimizations - if you have a mutable buffer with data, you can make a string out of them with zero copy and zero allocations. strtok(3) is an example of such approach, but I've implemented plenty of similar parsers back in the day. INI, CSV, JSON, XML - query file size, allocate buffer once, read it into the buffer, drop some NULL's into strategic positions, maybe shuffle some bytes around for that rare escape case, and you have a whole bunch of C strings, ready to use, and with no length limits.
Compared to this, Pascal strings would be incredibly painful to use... So you query file size, allocate, read it, and then what? 1-byte length is too short, and for 2+ byte length, you need a secondary buffer to copy string to. And how big should this buffer be? Are you going to be dynamically resizing it or wasting some space?
And sure, _today_ I no longer write code like that, I don't mind dropping std::string into my code, it'd just a meg or so of libraries and 3x overhead for short strings - but that's nothing those days. But back when those conventions were established, it was really really important.
We're just going to ignore Amigas, and any Unix workstations?
I have also done this, but I would argue that, even at the time, the design was very poor. A much much better solution would have been wise pointers — pass around the length of the string separately from the pointer, much like string_view or Rust’s &str. Then you could skip the NULL-writing part.
Maybe C strings made sense on even older machines which had severely limited registers —- if you have an accumulator and one resister usable as a pointer, you want to minimize the number of variables involved in a computation.
This is a red herring, because when you actually read the strings out, you still need to iterate through the length for each string—zero copy, zero allocation, but linear complexity.
> query file size, allocate buffer once, read it into the buffer, drop some NULL's into strategic positions, maybe shuffle some bytes around for that rare escape case, and you have a whole bunch of C strings, ready to use, and with no length limits.
I write parsers in a very different way—I keep the file buffer around as read-only until the end of the pipeline, prepare string views into the buffer, and pipe those along to the next step.
From strtok man page... "The first time that strtok() is called, str should be specified; subsequent calls, wishing to obtain further tokens from the same string, should pass a null pointer instead."
Really?? a null pointer.. This is valid code:
char str[] = "C is fucking weird, ok? I said it, sue me.";
char *result = strtok(str, ",");
char *res = strtok(NULL, ",");
Why is that ok?[1]: https://learn.microsoft.com/en-gb/windows/apps/design/global...
It's possible to define your own string_view workalike that has a c_str() and binds to whatever is stringlike can has a c_str. It's a few hundred lines of code. You don't have to live with the double indirection.
Today Bjarne will insist that the correct understanding of ownership and lifetimes was always inherent in his language and if you point out that it basically only warrants a brief mention in his early books about the language he'll say that the newer books have much more about this, as if that's not a confession...
Because &str (the non-owning reference to a slice of UTF-8 encoded text) in Rust is so ubiquitous, it's completely reasonable in Rust to use anything from the simple standard library owning String type, which is literally a growable array, Vec<u8> plus the promise about UTF-8 encoding, through to text which lives only on the stack as a re-interpreted array, or at the other extreme a raw pointer-sized short-string optimisation https://crates.io/crates/cold-string -- where unless it fits inline the string is length-prefixed on the heap like Pascal. All these approaches fit different niches and since a mere reference to the string is the same for all of them they're compatible. In C++ you will run into too many problems and regret trying.
Edited: Correct that String lives in the stdlib -- it isn't baked in to the core language because your tiny embedded system may not have an allocator to go around creating fancy growable arrays, but it still might want the non-owning string slice, &str which truly is in the core language.