undefined | Better HN

0 pointsnerdponx5y ago0 comments

what's wrong about strings representing text?

You're not making an argument about backward compatibility here, you're making a strong claim that representing text as a sequence of Unicode code points is fundamentally wrong. I have never heard anyone make this point before, and I am inclined to disagree, but I'm curious what your reasoning is for it.

0 comments

7 comments · 1 top-level

naniwaduni5y ago· 6 in thread

Indeed, representing text as a sequence of Unicode code points is fundamentally wrong.

There are no operations on sequences of Unicode code points that are more correct than an analogous operation on bytes.

(Everyone's favourite example, length, actually becomes less correct—a byte array's length at least corresponds to the amount of space one might have to allocate for it in a particular encoding. A length in codepoints is absolutely meaningless both technically and linguistically. And this is, for what little it's worth, close to the only operation you can do on a string without imposing additional restrictions about its context.)

orf5y ago

> There are no operations on sequences of Unicode code points that are more correct than an analogous operation on bytes.

That’s ridiculous. Uppercasing/lowercasing, slicing, “startswith”, splitting, etc etc.

Your statement is correct if you only care about ascii.

hvdijk5y ago

Uppercasing/lowercasing cannot be done on Unicode code points, because that fails to handle things like ﬁ -> FI where the uppercased version does not consist of the same number of Unicode code points. Slicing and splitting cannot be done on Unicode code points because it may separate a character from a subsequent combining character. "startswith" cannot be done on Unicode code points because some distinct code points need to be treated as equivalent. These are pretty much the same problems you also have when you perform those same operations on bytes. You might encounter those problems in fewer cases when you perform operations on code points rather than on bytes, but you won't have solved the problems entirely.

1 more reply

naniwaduni5y ago

None of those operations are correct on Unicode codepoints. Your statement is only just barely tenable if you only care about well-edited and normalized formal prose in common Western languages.

Even then, upper/lower-casing is iffy.

lolc5y ago

> There are no operations on sequences of Unicode code points that are more correct than an analogous operation on bytes.

Wow. I wonder how you arrived at this point. You can't, for example, truncate a UTF-8 byte array without the risk of producing a broken string. But this is only the start. Here are two strings, six letters each, one in NFC, the other in NFD, and their byte-length in UTF-8:

    "Åström" is 8 bytes in UTF-8

    "Åström" is 10 bytes in UTF-8

If your software tells the user that one is eight and the other is 10 letters long, it is not "less correct". It is incorrect. Further, if searching for "Åström" won't find "Åström", your software is less useful than it could be if it knew Unicode. (And it's sad how often software gets this wrong.)

tsimionescu5y ago

> If your software tells the user that one is eight and the other is 10 letters long, it is not "less correct". It is incorrect.

In fact, if the software tells you that either of the strings is either 8 or 10 letters wrong, then either way the software is incorrect - those are both obviously 6 letter strings.

Now, does UTF8 help you discover they are 6 letter strings better than other representations? There are certainly text-oriented libraries that can do that, but not those that simply count the UTF8 code points - they must have an understanding of all of Unicode. Even worse, the question "how many letters does this string have" is not generally meaningful - there are plenty of perfectly valid unicode strings for which this question doesn't have a meaningful answer.

However, the question "how many unicode code points does this string have" is almost never of interest. You either care about some notion of unique glyphs, or you care about byte lengths.

1 more reply

naniwaduni5y ago

You can't truncate a sequence of Unicode codepoints without the risk of producing a broken string, either. What do you get if you truncate "Åström" after the first "o"? What do you get if you truncate 🇨🇦 after the first codepoint?

Normalization is not a real solution unless you restrict yourself to working with well-edited formal prose in common Western languages.

This is not a claim made from ignorance.

1 more reply

j / k navigate · click thread line to collapse

0 comments

7 comments · 1 top-level

naniwaduni5y ago· 6 in thread

Indeed, representing text as a sequence of Unicode code points is fundamentally wrong.

There are no operations on sequences of Unicode code points that are more correct than an analogous operation on bytes.

orf5y ago

> There are no operations on sequences of Unicode code points that are more correct than an analogous operation on bytes.

That’s ridiculous. Uppercasing/lowercasing, slicing, “startswith”, splitting, etc etc.

Your statement is correct if you only care about ascii.

hvdijk5y ago

1 more reply

naniwaduni5y ago

None of those operations are correct on Unicode codepoints. Your statement is only just barely tenable if you only care about well-edited and normalized formal prose in common Western languages.

Even then, upper/lower-casing is iffy.

lolc5y ago

> There are no operations on sequences of Unicode code points that are more correct than an analogous operation on bytes.

    "Åström" is 8 bytes in UTF-8

    "Åström" is 10 bytes in UTF-8

tsimionescu5y ago

> If your software tells the user that one is eight and the other is 10 letters long, it is not "less correct". It is incorrect.

In fact, if the software tells you that either of the strings is either 8 or 10 letters wrong, then either way the software is incorrect - those are both obviously 6 letter strings.

However, the question "how many unicode code points does this string have" is almost never of interest. You either care about some notion of unique glyphs, or you care about byte lengths.

1 more reply

naniwaduni5y ago

Normalization is not a real solution unless you restrict yourself to working with well-edited formal prose in common Western languages.

This is not a claim made from ignorance.

1 more reply

j / k navigate · click thread line to collapse