Python 3 tried to have its cake and eat it too by choosing the most compact encoding depending on the string, but in practice this wastes a lot of space. You'll double (or heaven forbid quadruple) your string size because of a single codepoint, and these codepoints are almost always a small percentage of the string. That's actually why UTF-16 and UTF-8 exist.
It would have been better for strings to default to UTF-8, and to add an optional encoding so the programmer can specify what kind of encoding to use. As it is now, in order to use (for example) UTF-16 strings in Python you have to keep them around as bytes, decode them to a string, perform string operations, and reencode them to bytes again. Any benefit you get from using UTF-16 vanishes the moment you need to operate on it like a string, in other words.
I get that the idea was to maintain indexing via codepoint, but (again) in practice that's not great: usually you want to index via grapheme -- if you want to index at all.
A better solution is to allow programmers to specify string encoding and default it to UTF-8. From that, there's a clear path to everything you'd want to do.
Perhaps I should give an example. Suppose you're parsing and dealing with something. Say HTML since it's well-known. So you receive a long byte array starting "<html><body><p>Sometimes</p>". You parse the byte array and produce a number of objects, including up to four strings, namely "html", "body", "p" and "Sometimes", and by the time you've stored those in objects and allocated them, they occupy 32 bytes each on the heap. If you use UCS-4 the last may need 48 or 64, depending on your allocator's rounding and buckets. The byte array you for from the I/O subsystem may be 100k but most of the strings in the code are short, and the impact of using UCS-4 is moderate.
A more interesting question is whether UCS-4's advantages are worth it. It provides an array of characters, but as the years pass, the code I see does ever less char-array processing on strings. 20-30 years ago the world was full of char pointers, now, not so much. Something like this looks more typical, and doesn't benefit much from UCS-4, if at all: foo.split(" ").each{|word| bar(word) }.
You are looking at the issue from the perspective from a language user, not a language designer. 20 years ago we didn't have languages such as Python/Ruby which had internal multibyte support in their sting manipulation functions. 20 years ago string manipulation functions didn't even exist!
But this post is about the design of the language, not the application, and the language is still written in C/C++ and _internally_ stores strings as byte arrays that must be presented nicely to the programmer in that language's string manipulation functions.
I definitely need indexes, and I don't really care about graphemes. I actually have only a vague idea what that is.
I write parsers typically by using a global string and lots of indices. The important thing for me is to be able to extract characters and slices at given positions, and to be able to say "parse error at line X character Y" where X and Y are helpful to the user most of the time.
I would be absolutely fine with working in UTF-8 bytes only (and that would be faster I guess), but there would be a more pressing need to recompute character positions (as a code point or grapheme index) from byte offsets at times.
There are more abstract parsing methods where parser subroutines are implemented in a position agnostic way, but I'm very happy with my simple method.
If everything works on graphemes instead of code points (as I think does Perl6) I will be happy to use that, but it's not so important from a practical standpoint.
No you don't. You need iterators, which behave like pointers. Let's say you're hundreds or thousands of characters into a string at the start of some token. Now you want to scan from that position to the end of the token.
With indexes it works fast only if it's by codepoint. in a language that properly supports graphemes this would mean it would have to scan from the beginning to get to that index.
With iterators it can start scanning from that position directly. Same speed no matter where you are in the string. With indexes the larger your input the slower your parse gets, and not in a linear way.
It's also super easy to get a slice using a start and end iterator. As for line x character y messages, you can't get that directly from an index as it depends on how many new lines you parsed so indexing doesn't help there.
Of course not, but it was considered that breaking O(1) indexing guarantees were a bridge too far even in the breaky release of Python 3.
> A better solution is to allow programmers to specify string encoding and default it to UTF-8.
Agreed. UTF-8 is the sensible default for most people.
So, don't decode to a string, and do all your character manipulation on the bytes.
> A better solution is to allow programmers to specify string encoding and default it to UTF-8.
Absolutely not: the internal representation of a string should be of no interest to a user of your language. The 'best' solution is to represent strings as a list of index lookups into a palette, and to update the palette as new graphemes are seen. This is similar to the approach Perl6 is using[0].
[0]: https://6guts.wordpress.com/2015/12/05/getting-closer-to-chr...
WHAT?!? I suppose that you've only ever worked with Latin characters. Please show a code example of changing European to African in this sentence in your language of choice, working on the bytes in any multi-byte encoding:
מהי מהירות האווירית של סנונית ארופאית ללא משא?
Yes, that is a Hebrew Monty Python quote. Now try it with a smiley somewhere in the string (HN filtered out my attempt to post the string with a smiley).
Is each application to maintain their own dictionary of code points? If the map is to be in a library, then why not have it in the language itself?
# Apparently we expect the field to be in this format ¯\_(ツ)_/¯
Right above the code he'd just fixed.
Of course, the moment we pushed the update it brought production down, because the Python interpreter doesn't understand Unicode in source files unless you specify which encoding you are using.
After that, "¯\_(ツ)_/¯" became a synonym for his name on our HipChat server, heh.
In Python 3, source code files are assumed to be UTF-8.
Though I also think the struggle is mostly due to people being stuck in an everything-is-like-ASCII mindset, and though I didn't get into that, it's one big reason why I think UTF-8 is generally the wrong way to expose Unicode to a programmer, since it lets them think they can keep that cherished "one byte == one character" assumption right up until something breaks at 2AM on a weekend.
Personally I'd like everyone to just actually learn at least the things about Unicode that I went into here (such as why "one code point == one character" is a wrong assumption), and I think that'd alleviate a lot of the pain. I also avoided talking much about normalization, because too many people hear about it and decide they can just normalize to NFKC and go back to assuming code point/character equivalence post-normalization.
Unfortunately, as long as you believe that you can index into a Unicode string, your code is going to break. The only question is how soon.
I actually like UTF-8 because it will break very quickly, and force the programmer to do the right thing. The first time you hit é or € or ️an emoji, you'll have a multibyte character, and you'll need to deal with it.
All the other options will also break, but later on:
- If you use UTF-16, then é and € will work, but emoji will still result in surrogate pairs.
- If you use a 4-byte representation, then you'll be able to treat most emoji as single characters. But then somebody will build é from two separate code points as "e + U+0301 COMBINING ACUTE ACCENT", or you'll run into a flag or skin color emoji, and once again, you're back at square zero.
You can't really index Unicode characters like ASCII strings. Written language is just too weird for that. But if you use UTF-8 (with a good API), then you'll be forced to accept that "str[3]" is hopeless very quickly. It helps a lot if your language has separate types for "byte" and "Unicode codepoint", however, so you can't accidentally treat a single byte as a character.
The solution to that is simple, don't let the programmer access individual bytes in a Unicode string.
Get rid of indexing into them and replace it with iterators. Make string handling functions work on code points at the very least but better yet on grapheme clusters. There's a little more to it than that but it's a good start.
Yes, people are still stuck in the ASCII mindset and can't seem to get away from thinking in bytes. But I belive it's the ability to index into strings is what's to blame and not the encoding used.
Go and Rust expose UTF-8 at the byte level. This is something of a headache and may result in invalid string slices. It basically punts the problem back to the user.
Here's an alternative: Use UTF-8 as the internal representation, but don't expose it to the user.
If you're iterating over a string one rune or one grapheme at a time, the UTF-8 substructure is hidden from the user. Only if the user uses an explicit numeric subscript do you need to know a rune's position in string. When a request by subscript comes in, scan the string and build an index of rune subscript->byte position. This is expensive, but no worse than UTF-32 in space usage or expansion to UTF-32 in time.
Optimizations:
- Requests for s[0] to s[N], and s[-1] to s[-N], for small N, should be handled by working forwards or backwards through the UTF-8. (Yes, you can back up by rune in UTF-8. That's one of the neat features of the representation.)
- Lookup functions such as "index" should return an opaque type which represents the position into that string. If such an object is used as a subscript, there's no need to build the index by rune. If you coerce this opaque type into an integer, the index table has to be built. Adding or subtracting small integers from this opaque type should be supported by working backwards or forwards in the string.
- Regular expression processing has to be UTF-8 aware. It shouldn't need an index by rune.
This would maintain Python's existing semantics while reducing memory consumption.
Some performance measurement tool that finds all the places where an index by rune has to be built is useful. It's rare that you really need this, but sometimes you do.
In Go, yes. In Rust, no. UTF-8 in Go is garbage in, garbage out. Rust, however, won't let you materialize an invalid &str without "unsafe".
If you need to make a decision based on the content of a string, then you often need to make a normalized (the same way for both) copy the inputs.
Most importantly, if you feed in garbage, you get out the SAME garbage. The real world, and historical data, are messy. Trying to be smart can often lead to the most disastrous consequences. Being conservative and tolerant allows for intentional planning to handle the conversion at the source, if and when desired.
Or you can take the C++/C approach and have a character 1 byte, 2 bytes, or a multi-byte. It's a pain in the ass to constantly in C/C++ having to interface between two libraries that one decided to use char and another w_char!
There are pros and cons to both approaches. The prime ones being that []byte allows for easy random access, whereas []rune usually takes O(n) to work with (unless you store rune lengths separately, which is memory intensive).
I guess it's about the right level of abstraction, so that you can choose if you're working with bytes (binary I/O, when you know it's ascii etc.) and when with runes (most situations).
I still haven't decided whether I prefer the Python approach or the Go one.
With integers you can do things like concatenate two strings and adjust the indexes referring to the second string by adding the length of the first one. If you invent a new position type you have to add support for several things like this.
In any case I think the Python people were right to carry on using integers.
Rust will panic on invalid slices unless you first convert to raw bytes, and then it will not allow converting invalid slices back to a string in safe rust (in unsafe you're obviously on your own).
Safe Rust guarantees and requires[0] that strings are valid UTF8 at all times.
That aside, essentially all of your desires are part of Swift's string, you should check them out.
> Requests for s[0] to s[N], and s[-1] to s[-N], for small N, should be handled by working forwards or backwards through the UTF-8. (Yes, you can back up by rune in UTF-8. That's one of the neat features of the representation.)
Rust does that through the `chars()` iterator[1] which iterates through USVs (codepoints) and can be iterated from both ends. Sadly unlike Swift it does not ship with a grapheme cluster iterator. Happily there is a unicode_segmentation crate[2]. Swift also uses iterators but has more of them: the default iteration works on extended grapheme clusters, and alternate iterators are USV, UTF-16 and UTF-8.
If indexing is necessary for some reason Rust also has char_indices() which iterates on the USV and its (byte) position in the string.
> - Lookup functions such as "index" should return an opaque type which represents the position into that string. If such an object is used as a subscript, there's no need to build the index by rune. If you coerce this opaque type into an integer, the index table has to be built. Adding or subtracting small integers from this opaque type should be supported by working backwards or forwards in the string.
That is what Swift does. `String.index(of:String)` will return a String.Index: https://developer.apple.com/documentation/swift/string.index and indexed String methods will work based on that index type. This includes "reindexing" (offsetting) which is done using String.index(String.Index, offsetBy: String.IndexDistance). Furthermore String exposes two built-in indexes startIndex and endIndex as well as an "indices" iterator.
> This would maintain Python's existing semantics while reducing memory consumption.
It would not maintain O(1) USV indexing (especially in the C API), which was the reason for not just switching to UTF8.
In fact, FSR strings already contain a full UTF8 representation of the string[3], which the latin1 representation can share for pure ASCII strings.
[0] a non-utf8 str is one of Rust's 10 undefined behaviours, part of the "invalid primitive values" section alongside null references or invalid booleans: https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html
[1] https://doc.rust-lang.org/std/primitive.str.html#method.char...
[2] https://kbknapp.github.io/clap-rs/unicode_segmentation/index...
[3] https://github.com/python/cpython/blob/49b2734bf12dc1cda80fd...
I agree that Python 2's Unicode handling is broken. That's why I just stored UTF-8 in a normal string and avoided the whole mess. The only thing I have to do is validate any input from the outside world is really UTF-8.
And the vast majority of strings in real-world Python contain only code points also present in latin-1, which means they can be stored in one byte per code point with this approach. And for strings which can't be stored in one byte per code point, you were similarly going to pay the price sooner or later.
I disagree with that premise. It should operate on grapheme clusters. Operating on code points falls into the same trap as operating on bytes.
> a correct implementation (which Python didn't have until 3.3!) would've imposed the overhead of conversion to something resembling a fixed-width encoding whenever a programmer invoked certain operations.
Those operations should have been removed. Indexing is the big one that needs fixed width internal representation for speed. Code could have been rewritten to not require indexing. But mechanical translation from Python 2 to 3 was a goal and because of that they couldn't radically change the unicode API for the better.
> And the vast majority of strings in real-world Python contain only code points also present in latin-1, which means they can be stored in one byte per code point with this approach. And for strings which can't be stored in one byte per code point, you were similarly going to pay the price sooner or later.
You're going to pay the price for 4 byte per codepoint strings quite often. A single emoji will blow up a latin-1 string to 4 times the size.
This only works if every library that you use agrees with you on this, and treats all strings you pass to it as UTF-8 whenever encoding matters.
OTOH, if you don't care about that, then you might as well just use bytes everywhere, and get the same thing. At least in Python 3, with bytes, if a library does try to use it as a string, you'll get an error, rather than silent wrong output.
Not exactly. The library just has to treat it as a string and not worry about the encoding (i.e. not try to encode it to/from the unicode type).
Only ran into this issue once and the library had an option to return everything as string so not a problem.
> At least in Python 3, with bytes, if a library does try to use it as a string, you'll get an error, rather than silent wrong output.
Bytes in Python 3 don't support string operators.
it is significantly more sane in python 3.3+.
(Also, that explanation of UTF-8 is crap. UTF-8 is beautiful quite apart from its utility, but you'd hardly know it from the article.)
I've said it before: Unicode is a conflation of a good idea and an impossible idea. The good idea is a standard mapping from numbers to little pictures. That's all ASCII was. The impossible idea is a digital code for every way humans write. It's a form of digital cultural imperialism.
Unicode Consortium et. al. are absurdly arrogant.
I heard, "Tell me more about what you think would be better." Here goes:
For written languages that are well-served by a simple sequence of symbols (English, etc.) there is no problem: a catalog of the mappings from numbers to pictures is fine is all that is required. Put them in a sequence (anoint UTF-8 as the One True Encoding) and you're good-to-go.
For languages that are NOT well-served by this simple abstraction the first thing to do (assuming you have the requisite breadth and depth of linguistic knowledge) is to figure out simple formal systems that do abstract the languages in question. Then determine equivalence classes and standardize the formal systems.
Let the structure of the language abstraction be a "first-class" entity that has reference implementations. Instead of adding weird modifiers and other dynamic behavior to the code, let them be actual simple DSLs whose output is the proper graphics.
Human languages are like a superset of what computers can represent.
Here's the Unicode Standard[1] on Arabic:
> The basic set of Arabic letters is well defined. Each letter receives only one Unicode character value in the basic Arabic block, no matter how many different contextual appearances it may exhibit in text. Each Arabic letter in the Unicode Standard may be said to represent the inherent semantic identity of the letter. A word is spelled as a sequence of these letters. The representative glyph shown in the Unicode character chart for an Arabic letter is usually the form of the letter when standing by itself. It is simply used to distinguish and identify the character in the code charts and does not restrict the glyphs used to represent it.
They baldly admit that Unicode is not good for drawing Arabic. I find the phrase "the inherent semantic identity of the letter" to be particularly rich. It's nearly mysticism.
If it is inconvenient to try to represent a language in terms of a sequence of symbols, then let's represent it as a (simple) program that renders the language correctly, which allows us to shoehorn non-linear behavior into a sequence of symbols.
If you think about it, this is already what Unicode is doing with modifiers and such. If you read further in the Unicode Standard doc I quoted above you'll see that they basically do create a kind of DSL for dealing with Arabic.
I'm saying: make it explicit.
Don't try to pretend that Unicode is one big standard for human languages. Admit that the "space" of writing systems is way bigger and more involved than Latin et. al. Study the problem of representing writing in a computer as a first-class issue. Publish reference implementations of code that can handle each kind of writing system along with the catalog of numbered pictures.
From the Unicode Standard again:
> The Arabic script is cursive, even in its printed form. As a result, the same letter may be written in different forms depending on how it joins with its neighbors. Vow-els and various other marks may be written as combining marks called tashkil, which are applied to consonantal base letters. In normal writing, however, these marks are omitted.
Computer systems that are adapted to English are not going to work for Arabic. I'd love to use a language simpler than PostScript to draw Arabic! Unicode strings are not that language.
Consider the "Base-4 fractions in Telugu" https://blog.plover.com/math/telugu.html
The fact that we have a way to represent the graphics ౦౼౽౾౸౹౺౻ is great! But any software that wants to use them properly will require some code to translate to and from numbers in the computer to Telugu sequences of those graphics.
Let that be part of "Unicode" and I'll shut up. In the meantime, I feel like it's a huge scam and a kind of cultural imperialism from us hacker types to the folks who are late to the party and for whom ASCII++ isn't going to really cut it.
To sum up: I think the thing that replaces Unicode for dealing with human languages in digital form should:
A.) Be created by linguists with help from computer folks, not by computer folks with some nagging from linguists (apologies to the linguist/computer folk who actually did the stuff.)
B.) We should clearly state the problems first: What are the ways that human language are written down?
C.) Write specific DSLs for each kind of writing. Publish reference implementations.
I think that's it. Are you informed? Persuaded even? Entertained at least? ;-)
My goal was not to judge UTF-8 aesthetically, but to explain how it works and point out that it's a variable-width encoding which emphasizes its compatibility with ASCII for strings containing only code points <= U+007F.
Unicode Consortium et. al. are absurdly arrogant.
I would agree that Unicode as it exists today involves some historical and historic bad decisions. But again, staying off value judgments with respect to Unicode itself since the point of the article was to explain how Python now handles it internally.
Apologies for being cranky. You did a great job explaining how Python now handles Unicode!
To me it was strange reading about UTF-32 first and then getting to UTF-8 from that context. It seemed to obscure the coolth and beauty of the format.
Overall a great article, sorry again for being so negative.