undefined | Better HN

0 pointslolc5y ago0 comments

The difference is that for truncating, I can work within Unicode to deal with the situation. I can accept the possibility of mutilated letters, I can convert to NFC, I can truncate on word-boundaries, I have choice.

If I have an byte-array, I can do none of these things short of implementing a good chunk of Unicode. If I truncate, I risk ending up with an invalid UTF-8 string. End of story.

0 comments

2 comments · 2 top-level

tsimionescu5y ago

And what is wrong with an invalid UTF-8 string? Why were you truncating the string in the first place?

Basically, I believe the point here is that a Unicode aware truncation should be done in a Unicode aware truncate method. There is no good reason to parse a string as UTF-8 ahead of time - just keep it as a blob of bytes until you need to do some something "texty" with it. It is the truncate-at-word-boundaries() method that should interpret the bytes as UTF-8 and fail if they are not valid. Why parse it sooner?

Jasper_5y ago

> If I have an byte-array, I can do none of these things short of implementing a good chunk of Unicode. If I truncate, I risk ending up with an invalid UTF-8 string.

Yes, and? You can have an invalid sequence of Unicode code points too, such as an unpaired surrogate (something Python's text model actually abuses to store "invalid Unicode" in a special, non-standard way).

If you truncate at the byte level, you are just truncating "between code points"; it's a closer granularity than at the code point layer, so you can also convert to NFC, truncate on word boundaries, etc. You just need to ignore the parts of the UTF-8 string that are invalid; which isn't difficult, because UTF-8 is self-synchronizing.

j / k navigate · click thread line to collapse