undefined | Better HN

0 pointssyncsynchalt6mo ago0 comments

Yes, it makes sense, but it still resulted in a lot of work.

Most Unix syscalls use C-style strings, which are a string of 8-bit bytes terminated with a zero byte. With many (most?) character encodings you can continue to present string data to syscalls in the same way, since they often also reserved a byte value of zero for the same purpose. Even some multi-byte encodings would work if they chose to avoid using 0-value bytes for this reason.

UTF-16LE/BE (and UTF-32 for that matter) chose not to allow for this, and the result is that if you want UTF-16 support in your existing C-string-based syscalls you need to make a second copy of every syscall which supports strings in your UTF-16 type of choice.

0 comments

codedokode6mo ago

> Most Unix syscalls use C-style strings, which are a string of 8-bit bytes terminated with a zero byte. With many (most?) character encodings you can continue to present string data to syscalls in the same way, since they often also reserved a byte value of zero for the same purpose

That's completely wrong. If a syscall (or a function) expects text in encoding A, you should not be sending it in encoding B because it would be interpreted incorrectly, or even worse, this would become a vulnerability.

For every function, encoding must be specified as are specified the types of arguments, constraints and ownership rules. Sadly many open source libraries do not do it. How are you supposed to call a function when you don't know the expected encoding?

Also, it is better to send a pointer and a length of the string rather than potentially infinitely search for a zero byte.

> and the result is that if you want UTF-16 support in your existing C-string-based syscalls

There is no need to support multiple encodings, it only makes things complicated. The simplest solution would be to use UTF-8 for all kernel facilities as a standard.

For example, it would be better if open() syscall required valid UTF-8 string for a file name. This would leave no possibility for displaying file names as question marks.

17186274406mo ago

Why should the OS mess with application data? I think syscalls should treat text as the blob it is and not care about the encoding at all.

codedokode6mo ago

File name is a string, not a blob.

1 more reply

j / k navigate · click thread line to collapse

0 pointssyncsynchalt6mo ago0 comments

Yes, it makes sense, but it still resulted in a lot of work.

0 comments

codedokode6mo ago

Also, it is better to send a pointer and a length of the string rather than potentially infinitely search for a zero byte.

> and the result is that if you want UTF-16 support in your existing C-string-based syscalls

There is no need to support multiple encodings, it only makes things complicated. The simplest solution would be to use UTF-8 for all kernel facilities as a standard.

For example, it would be better if open() syscall required valid UTF-8 string for a file name. This would leave no possibility for displaying file names as question marks.

17186274406mo ago

Why should the OS mess with application data? I think syscalls should treat text as the blob it is and not care about the encoding at all.

codedokode6mo ago

File name is a string, not a blob.

1 more reply

j / k navigate · click thread line to collapse