undefined | Better HN

0 pointsnaniwaduni5y ago0 comments

You can't truncate a sequence of Unicode codepoints without the risk of producing a broken string, either. What do you get if you truncate "Åström" after the first "o"? What do you get if you truncate 🇨🇦 after the first codepoint?

Normalization is not a real solution unless you restrict yourself to working with well-edited formal prose in common Western languages.

This is not a claim made from ignorance.

0 comments

5 comments · 1 top-level

lolc5y ago· 4 in thread

Sorry, we're mixing two layers. Of course, if I truncate a string, it may lose its meaning. And having accents fall off is problematic. But it's not the same as truncating a byte-array, because then an invalid sequence of bytes may result.

Stop treating these cases as equivalent. They're not.

naniwaduniOP5y ago

They are equivalent. The only reason you find it problematic that a sequence of bytes is "invalid" (read: can't be decoded in your preferred encoding) is because you've manufactured the problem.

In the end, the only layer at which it really matters whether your byte sequence can be decoded is the font renderer, and just being valid utf-8 isn't good enough for it either.

lolc5y ago

> In the end, the only layer at which it really matters whether your byte sequence can be decoded is the font renderer

Ok that explains how we ended up here. I'm considering some other common uses! A search-index for example greatly profits from being able to normalize representations and split words.

1 more reply

Jasper_5y ago

So in one case, the text becomes corrupted and unreadable (i.e. loses its meaning), and in the other, it becomes corrupted and unreadable. What's the difference?

Having "accents fall off" has gotten people murdered [0]. Accents aren't things peppered in for effect, they turn letters into different letters, spelling different words. Analogously, imagine that a bunch of software accidentally turned every "d" into a "c" because some committee halfway around the world decided "d" should be composed of the "c" and "|" glyphs. That's the kind of text corruption that regularly happens in other languages when dealing with text at the code point layer.

[0] https://languagelog.ldc.upenn.edu/nll/?p=73 . Note that this is Turkish, which has the "dotted i" problem, meaning that this was more than likely a .toupper() gone wrong rather than a truncation issue.

lolc5y ago

The difference is that for truncating, I can work within Unicode to deal with the situation. I can accept the possibility of mutilated letters, I can convert to NFC, I can truncate on word-boundaries, I have choice.

If I have an byte-array, I can do none of these things short of implementing a good chunk of Unicode. If I truncate, I risk ending up with an invalid UTF-8 string. End of story.

2 more replies

j / k navigate · click thread line to collapse

0 comments

5 comments · 1 top-level

lolc5y ago· 4 in thread

Stop treating these cases as equivalent. They're not.

naniwaduniOP5y ago

They are equivalent. The only reason you find it problematic that a sequence of bytes is "invalid" (read: can't be decoded in your preferred encoding) is because you've manufactured the problem.

In the end, the only layer at which it really matters whether your byte sequence can be decoded is the font renderer, and just being valid utf-8 isn't good enough for it either.

lolc5y ago

> In the end, the only layer at which it really matters whether your byte sequence can be decoded is the font renderer

Ok that explains how we ended up here. I'm considering some other common uses! A search-index for example greatly profits from being able to normalize representations and split words.

1 more reply

Jasper_5y ago

So in one case, the text becomes corrupted and unreadable (i.e. loses its meaning), and in the other, it becomes corrupted and unreadable. What's the difference?

lolc5y ago

If I have an byte-array, I can do none of these things short of implementing a good chunk of Unicode. If I truncate, I risk ending up with an invalid UTF-8 string. End of story.

2 more replies

j / k navigate · click thread line to collapse