undefined | Better HN

0 pointsdataflow6y ago0 comments

Take out "utf8" and I'll agree ;)

The fundamental problem as I see it is that "string" is a grossly leaky and misunderstood abstraction. The string type is not the same thing as a "text" type. It's being used in all the wrong places for that purpose. People treat "string" like it means "text", but in so many places where we deal with them, they just aren't (and should never be) text. Everything from stdio to argv to file paths to environment variables to "text" files to basically any interface with the outside world needs to be dealt with in bytes rather than text if you care about actually producing correct code that doesn't lose, corrupt, or otherwise choke on data.

C++ understood this and got it right, preferring to focus on optimizing rather than constraining the string type. Many other languages did pretty well by avoiding enforcing encodings on strings, too. And Python 2 defaulted to bytes as well, and only really cared about encoding/decoding at I/O boundaries where it thought it can assume it's dealing with text (though it sometimes didn't behave well there, and yes it got painful as a result). Then Python 3 came along and just made everyone start treating most data as if they're inherently (Unicode) text by default, when they really had no such constraints to begin with.

It boggles my mind that Python 3 folks like to beat the drum on how Python 3 got the bytes/unicode right without taking a single moment to even notice that most strings people deal with aren't (and never were!) actually guaranteed to be in a specific, known textual encoding a priori. They were just arrays of code units with few restrictions on them, and if you want to write correct code, you're going to have to deal with bytes by default (or something else with similar flexibility) instead of text. It would've been totally fine to introduce a text type, but it fundamentally can't take the place of a blob type, which is the language of the outside world.

0 comments

5 comments · 1 top-level

int_19h6y ago· 4 in thread

"The outside world", by and large, also speaks Unicode.

Java uses UTF-16 throughout, including file paths. So does .NET. All Apple platforms are UTF-16. C++ - if you just look at stdlib, sure, it's byte-centric; but then look at popular frameworks such as Qt.

In practice, this means that, yeah, you can have that odd filename that is technically not Unicode. But the vast majority of code running on the most popular desktop and mobile platforms is going to handle it in a way that expects it to be Unicode. Why should Python go against the trend, and make life more complicated for developers using it in the process?

mehrdadn6y ago

File names? I listed so much more for you than file names.

That HTML you just fetched? How do you know it's Unicode?

That .txt file the user just asked to load? How do you know that's Unicode?

For heaven's sake, when can you actually guarantee that even sys.stdin.read() is going to read Unicode? You can only do that when you're the one piping your own stdin... which is not the common case.

What do you do when your fundamentally invalid assumptions break? Do you just not care and simply present a stack trace to the user and tell them to get lost?

I've gotten tired of these debates though, so just a heads up I may not have the energy to reply if you continue...

gvjddbnvdrbv6y ago

In the real world Python2 gave stack traces by default when presented with common strings. Python3 doesn't.

imtringued6y ago

>That HTML you just fetched? How do you know it's Unicode?

Headers contain information about the charset. If the charset isn't specified then only god knows the used encoding. This applies to all encodings. If they aren't specified you can't interpret them.

>That .txt file the user just asked to load? How do you know that's Unicode?

If you don't know the used encoding then you simply cannot interpret the file as a string. If the encoding isn't specified you can't interpret the file.

>For heaven's sake, when can you actually guarantee that even sys.stdin.read() is going to read Unicode?

Again if the encoding isn't specified then all bets are off. This is an inherent problem with unix pipes. Text isn't any different than say a protobuffer packet. You have to know how to interpret it otherwise it's just a raw byte array without any meaning.

>What do you do when your fundamentally invalid assumptions break? Do you just not care and simply present a stack trace to the user and tell them to get lost?

I don't understand you at all. Just load it as a byte array if you don't care about the encoding. If you do care about the encoding then tough luck. You're never going to understand the meaning of that text unless it is an agreed upon encoding like UTF-8 and in that case the assumptions of always choosing UTF-8 are part of the value proposition.

Let me tell you why reading a text file as a byte array and pretending that character encodings don't exist is a bad idea. There are lots of Asian character encodings that don't even contain the latin alphabet. Now imagine you are running source.replace("Donut", "Bagel"). What meaning does running this function have on a byte array? It doesn't have any.

That operation simply cannot be implemented at all if you don't know the encoding. So if you were to choose the python 2 way then you would have to either remove all string operations from the language or force the user to specify the encoding on every operation.

A string literal like "Donut" isn't just a string literal. It has a representation and you first have to convert the logical string into a byte array that matches the representation of the source string. Lets say your python program is loading UTF-16 text. Instead of simply specifying the encoding you just load the text without any encoding. If you wanted to run the replace operation then it would have to look like something like this: source.replace("Donut".getBytes("UTF-16"), "Bagel".getBytes("UTF-16")). This is because you need to convert all string literals to match the encoding of the text that you want to replace.

Well, doesn't this cause a pretty huge problem? You now need to have a special type just for string literals because the runtime string type can use any encoding and therefore isn't guaranteed to be able to represent the logical value of a literal. Isn't that extremely weird?

1 more reply

scoot_7186y ago

Many of us deal in bytes that simply aren't UTF8 and never could be. Because they're just bytes.

How many things are stored as binary files?

> All Apple platforms are UTF-16.

I'm glad all their executable files are apparently text files. How amazing.

> Why should Python go against the trend, and make life more complicated for developers using it in the process?

You tell me why Python3 did that.

j / k navigate · click thread line to collapse