undefined | Better HN

0 pointsnaniwaduni5y ago0 comments

Every circumstance. Why do you consider it unviable? What problems do you think having a Unicode sequence solves?

0 comments

4 comments · 1 top-level

lolc5y ago· 3 in thread

Convince me. Here's a little library function that turns text into a set of words:

    def keywords(text):
        return set(filter(None, re.split("\W+", unicodedata.normalize("NFKC", input_str).lower())))

How would this look if strings were byte-arrays? How would `normalize()`, `lower()`, and `split()` know what encoding to use?

The way I see it: If the encoding is implicit, you have global state. If it's explicit, you have to pass the encoding. Both is extra state to worry about. When the passed value is a Unicode string, this question doesn't come up.

naniwaduniOP5y ago

It looks pretty much the some, except that you assume the input is already in your library's canonical encoding (probably utf-8 nowadays).

I realize this sounds like a total cop-out, but when the use-case is destructively best-effort tokenizing an input string using library functions, it doesn't really matter whether your internal encoding is utf-32 or utf-8. I mean, under the hood, normalize still has to map arbitrary-length sequences to arbitrary-length sequences even when working with utf-32 (see: unicodedata.normalize("NFKC", "a\u0301 ﬃ") == "\xe1 ffi").

So on the happy path, you don't see much of a difference.

The main observable difference is that if you take input without decoding it explicitly, then the always-decode approach has already crashed long before reaching this function, while the assume-the-encoding approach probably spouts gibberish at this point. And sure, there are plenty of plausible scenarios where you'd rather get the crash than subtly broken behaviour. But ... I don't see this reasonably being one of them, considering that you're apparently okay with discarding all \W+.

suzuki5y ago

I agree with you. I wish Python 3 had strings as byte sequences mainly in UTF-8 as Python 2 had once and Go has now. Then things would be kept simple in Japan. Python 3 feels cumbersome. To handle a raw input as a string, you must decode it in some encoding first. It is a fragile process. It would be adequate to treat the input bytes transparently and put an optional stage to convert other encodings to UTF-8 if necessary.

lolc5y ago

I know this from PHP, where I have to be aware of the encoding the strings are in. I still don't see what the advantage should be of that.

j / k navigate · click thread line to collapse