undefined | Better HN

0 pointsimtringued6y ago0 comments

>That HTML you just fetched? How do you know it's Unicode?

Headers contain information about the charset. If the charset isn't specified then only god knows the used encoding. This applies to all encodings. If they aren't specified you can't interpret them.

>That .txt file the user just asked to load? How do you know that's Unicode?

If you don't know the used encoding then you simply cannot interpret the file as a string. If the encoding isn't specified you can't interpret the file.

>For heaven's sake, when can you actually guarantee that even sys.stdin.read() is going to read Unicode?

Again if the encoding isn't specified then all bets are off. This is an inherent problem with unix pipes. Text isn't any different than say a protobuffer packet. You have to know how to interpret it otherwise it's just a raw byte array without any meaning.

>What do you do when your fundamentally invalid assumptions break? Do you just not care and simply present a stack trace to the user and tell them to get lost?

I don't understand you at all. Just load it as a byte array if you don't care about the encoding. If you do care about the encoding then tough luck. You're never going to understand the meaning of that text unless it is an agreed upon encoding like UTF-8 and in that case the assumptions of always choosing UTF-8 are part of the value proposition.

Let me tell you why reading a text file as a byte array and pretending that character encodings don't exist is a bad idea. There are lots of Asian character encodings that don't even contain the latin alphabet. Now imagine you are running source.replace("Donut", "Bagel"). What meaning does running this function have on a byte array? It doesn't have any.

That operation simply cannot be implemented at all if you don't know the encoding. So if you were to choose the python 2 way then you would have to either remove all string operations from the language or force the user to specify the encoding on every operation.

A string literal like "Donut" isn't just a string literal. It has a representation and you first have to convert the logical string into a byte array that matches the representation of the source string. Lets say your python program is loading UTF-16 text. Instead of simply specifying the encoding you just load the text without any encoding. If you wanted to run the replace operation then it would have to look like something like this: source.replace("Donut".getBytes("UTF-16"), "Bagel".getBytes("UTF-16")). This is because you need to convert all string literals to match the encoding of the text that you want to replace.

Well, doesn't this cause a pretty huge problem? You now need to have a special type just for string literals because the runtime string type can use any encoding and therefore isn't guaranteed to be able to represent the logical value of a literal. Isn't that extremely weird?

0 comments

3 comments · 1 top-level

mehrdadn6y ago· 2 in thread

I'm too tired of these to reply to everything, so I'll just reply to the first bit and rest my case. It's like you're completely ignoring the fact that <meta charset="UTF-8"> and <?xml encoding="UTF-8"...?> and all that are actually things in the real world. You can't just treat them as strings until you read their bytes, was my point. The notion that the user can or should always provide you out-of-band encoding info or otherwise let you assume UTF-8 everywhere every time you read a file or stdin is just a fantasy and not how so many of our tools work.

int_19h6y ago

So treat them as bytes. It's not like Python 3 removed that type. It just made it impossible to inadvertently treat bytes as a string in a certain encoding - unlike Python 2, which would happily implicitly decode assuming ASCII.

mehrdadn6y ago

> So treat them as bytes.

Which was my entire point!! You have to go to bytes to get correct behavior. They didn't fix the nonsense by changing the default data type to a string, they just made it even more roundabout to write correct code.

> It just made it impossible to inadvertently treat bytes as a string in a certain encoding

It most certainly did not! It's like you completely ignored what I just told you. I already gave you an example: sys.stdin.read(). Uses some encoding when you really can't ever guarantee any encoding, or when the encoding info itself, is embedded in the byte stream is the normal case. How do can you know a priori what the user piped in? Are you sure users magically know every stream's encoding and just neglecting to provide it to you? At least if they were bytes by default, you'd maintain correct state and only have to worry about encoding/decoding at the I/O boundary. (And to top off the insanity, it's not even UTF-8 everywhere; on Windows it's CP-1252 or something, so you can't even rely on the default I/O being portable across platforms, even for text! Let alone arbitrary bytes. This insanity was there in Python 2, but they sure didn't make it better by moving from bytes to text as the default...)

1 more reply

j / k navigate · click thread line to collapse