undefined | Better HN

0 pointsnaniwaduni5y ago0 comments

> Not really. How would “.toupper()” work on a raw set of bytes, which would either contain an MP3 file or UTF8 encoded text?

It doesn't. It doesn't work with Unicode either. No, not "would need giant tables", literally doesn't work—you need to know whether your text is Turkish.

> How would slicing work? I want the first 4 characters of a given string. That’s completely meaningless without an encoding.

It's meaningless with an encoding: what are the first four characters of "áíúéó"? Do you expect "áí"? What are the first four characters of "ﷺ"? Trick question, that's one unicode codepoint.

At least with bytes you know that your result after slicing four bytes will fit in a 4-byte buffer.

> How would concatenation work? I’m not saying Python does this, but concatenation two graphemes together doesn’t necessarily create a string with len() == 2.

It doesn't work with Unicode either. I'm sure you've enjoyed the results of concatenating a string with an RTL marker with unsuspecting text.

It gets worse if we remember try to ascribe linguistic meaning to the text. What's the result of concatenating "ranch dips" with "hit singles"?

> How would “.startswith()” work with regards to grapheme clusters?

It doesn't. "🇨" is a prefix of "🇨🇦"; "i" is not a prefix of "ĳ".

> Text is different from bytes. There’s extra meaning and information attached to an arbitrary stream of 1s and 0s that allows you to do things you wouldn’t have been able to before.

None of the distinctions you're trying to make are tenable.

0 comments

No comments yet.