Maybe, but you'll do so in an understandable way that you can catch in testing. You won't suddenly start fetching different wrong data when someone builds your program with a newer version of the compiler, which is a very real risk with C. To say nothing of the risk of arbitrary code execution if your program is operating on attacker-supplied data.
> The irony, of couse, is that the characters that can break your terminal session are perfectly valid UTF-8.
Terminals that can't handle the full UTF-8 range are a problem with those terminals IMO. And terminals implemented in Rust probably don't have that problem :).
No, it isn't, and yes, they would. The problem is that the terminal accepts certain valid UTF-8 characters (typically from the ASCII subset) as output control characters. This is how you get things like programs that can output colored text.
This is a part of the fundamental design of how a terminal device is supposed to work: its input is defined to be a single stream of characters, and certain characters (or sequences of characters) represent control sequences that change the way other characters are output. The problem here is with the design of POSIX in general and Linux in particular - the fact that, despite knowing most interaction will be done through a terminal device with no separate control and data channels, they chose to allow control characters as part of file names.
As a result of this, it is, by design, impossible to write a program that can print out any legal file name to a terminal without risking to put the terminal in a display state that the user doesn't expect. Best you could do is recognize terminal control sequences in file names, recognize if your output device is a terminal, and in those cases print out escaped versions of those character sequences.
The terminal should not allow such a sequence to break it. Yes, being able to output colour is desirable, but it shouldn't come at the cost of breaking, and doesn't need to. (Whereas it is much less unreasonable for a terminal to break when it's sent invalid UTF-8).
> This is a part of the fundamental design of how a terminal device is supposed to work: its input is defined to be a single stream of characters, and certain characters (or sequences of characters) represent control sequences that change the way other characters are output.
"Design" and "supposed to" are overstating things. It's a behaviour that some terminals and some programs have accreted.
> it is, by design, impossible to write a program that can print out any legal file name to a terminal without risking to put the terminal in a display state that the user doesn't expect
I would not say by design, and I maintain that the terminal should handle it.
The only problem is that it's not what the user wanted to happen. For a simple example, if a file name contains the control sequence for starting a block of red text, and you print that file name as is in a terminal, you'll, (1), see a truncated file name (that is, copying the text from terminal will not give you the actual file name, since the control characters will be entirely missing), and (2) all future text will be red.
The terminal has done nothing wrong in this case: it used its normal logic for turning text red. The file name is not in any way wrong - it's a completely valid Linux and ext4 file name. The program is not necessarily doing anything wrong - perhaps it was never designed to print to a terminal. But the overall interaction produces the wrong results.
I'm aware of the details, but I think sometimes that knowledge leads people to miss the forest for the trees. If the user perceives the terminal as having "broken", that's a case of poor UX design at a minimum. Given that users can readily distinguish between legitimate coloured output etc. and terminals getting into a poor state, it really shouldn't be too hard for the terminal itself to do so. (E.g. it's pretty normal for today's terminals to display some kind of visible warning (complete with resume button) when you press Ctrl-S, rather than simply silently stopping). And while this is a much fuzzier and more contentious claim, I think the Rust community's mentality (as seen in e.g. their approach to compiler errors) nudges people towards such approaches.
There's also the question of what happens if the data structures on disk become corrupted. The filesystem driver might or might not validate the "string" it reads back before returning it to you.
If Linux had chosen to standardize file names at the syscall level to a safe subset of UF-8 (or even ASCII), FS writers would never have even seen file names that contain invalid sequences, and we would have been spared from a whole set terminal issues. Of course, UTF-8 was nowhere close to as universally adopted as it is today at the time Linux was developed, so it's just as likely they might have standardized to some subset of UTF-16 or US-ASCII and we might have had other kinds of problems, so it's arguable they took the right decision for that time.
The control characters are themselves valid (but unprintable) UTF-8. They are also, against all common sense and reason, permitted within filenames by many filesystems. Rust won't save you here. https://www.compart.com/en/unicode/category/Cc
What guarantees it? Literally nothing. You can catch errors in testing in C as well. Yeah, in C you get "different" data on a different version of the compiler, but you get garbage data in all versions of the compiler and Valgrind flags that in testing.
And, of course, you can get arbitrary code execution if your program is operating on data coming from multiple users of different privilege levels in Rust if you use vectors like that.
Sure, Rust fixes a lot of easy to commit bugs in C and C++, absolutely. But there are no absolutes.
The behaviour of looking up index x in array y in rust is well-defined, and consistent between compiler versions. Maybe you use the wrong index and get the wrong customer's data or something, but you'll still get the data at that index (or a panic if the index is invalid).
> You can catch errors in testing in C as well. Yeah, in C you get "different" data on a different version of the compiler, but you get garbage data in all versions of the compiler and Valgrind flags that in testing.
Not always. It's very common for code to be broken according to the standard but do the right thing in some compilers, and then in a different compiler it does something completely bizarre. The code might not do the same thing in testing as it does in release. And while Valgrind does a lot of good for the minority of C programmers who use it, it's far from 100% reliable.
> you can get arbitrary code execution if your program is operating on data coming from multiple users of different privilege levels in Rust if you use vectors like that.
You might in some cases, but it's a lot tricker. You can get the program to operate on different parts of its data than it was intended to, but to go from there to running arbitrary code will still be a significant leap and require an exploitation technique specific to that particular program. Whereas the techniques for going from common classes of C vulnerabilities (e.g. buffer overflow or use after free) to arbitrary code execution are practically textbook at this point.