How Python does Unicode (opens in new tab)

(b-list.org)

165 pointsscw8y ago136 comments

136 comments

50 comments · 10 top-level

camgunz8y ago· 12 in thread

UCS-4 is essentially never the right choice. It wastes space and thus messes up your cache. UCS-2 can be the right choice if the language you're encoding uses a lot of non-Latin glyphs (i.e. East Asian languages) but suffers from the same problem as UCS-2. UTF-8 is a good default: for most strings it's very compact, and for strings with a lot of multibyte codepoints it doesn't compare too unfavorably with UTF-16.

Python 3 tried to have its cake and eat it too by choosing the most compact encoding depending on the string, but in practice this wastes a lot of space. You'll double (or heaven forbid quadruple) your string size because of a single codepoint, and these codepoints are almost always a small percentage of the string. That's actually why UTF-16 and UTF-8 exist.

It would have been better for strings to default to UTF-8, and to add an optional encoding so the programmer can specify what kind of encoding to use. As it is now, in order to use (for example) UTF-16 strings in Python you have to keep them around as bytes, decode them to a string, perform string operations, and reencode them to bytes again. Any benefit you get from using UTF-16 vanishes the moment you need to operate on it like a string, in other words.

I get that the idea was to maintain indexing via codepoint, but (again) in practice that's not great: usually you want to index via grapheme -- if you want to index at all.

A better solution is to allow programmers to specify string encoding and default it to UTF-8. From that, there's a clear path to everything you'd want to do.

Arnt8y ago

I've done the math for a few of the programs I've worked on, and the waste was negligible every time. A lot of strings like "reply" that end up being ten bytes longer in UCS-4 than UTF-8 once you add all the object and allocator overhead, progressively fewer long strings. Even the string-heavy code I worked on didn't spend much more than ten per cent of its total memory on strings, having the typical object be 40 instead of 30 bytes wasn't a big deal then.

Perhaps I should give an example. Suppose you're parsing and dealing with something. Say HTML since it's well-known. So you receive a long byte array starting "<html><body><p>Sometimes</p>". You parse the byte array and produce a number of objects, including up to four strings, namely "html", "body", "p" and "Sometimes", and by the time you've stored those in objects and allocated them, they occupy 32 bytes each on the heap. If you use UCS-4 the last may need 48 or 64, depending on your allocator's rounding and buckets. The byte array you for from the I/O subsystem may be 100k but most of the strings in the code are short, and the impact of using UCS-4 is moderate.

A more interesting question is whether UCS-4's advantages are worth it. It provides an array of characters, but as the years pass, the code I see does ever less char-array processing on strings. 20-30 years ago the world was full of char pointers, now, not so much. Something like this looks more typical, and doesn't benefit much from UCS-4, if at all: foo.split(" ").each{|word| bar(word) }.

dotancohen8y ago

> A more interesting question is whether UCS-4's advantages are worth it. It provides an array of characters, but as the years pass, the code I see does ever less char-array processing on strings. 20-30 years ago the world was full of char pointers, now, not so much. Something like this looks more typical, and doesn't benefit much from UCS-4, if at all: foo.split(" ").each{|word| bar(word) }.

You are looking at the issue from the perspective from a language user, not a language designer. 20 years ago we didn't have languages such as Python/Ruby which had internal multibyte support in their sting manipulation functions. 20 years ago string manipulation functions didn't even exist!

But this post is about the design of the language, not the application, and the language is still written in C/C++ and _internally_ stores strings as byte arrays that must be presented nicely to the programmer in that language's string manipulation functions.

2 more replies

jstimpfle8y ago

> I get that the idea was to maintain indexing via codepoint, but (again) in practice that's not great: usually you want to index via grapheme -- if you want to index at all.

I definitely need indexes, and I don't really care about graphemes. I actually have only a vague idea what that is.

I write parsers typically by using a global string and lots of indices. The important thing for me is to be able to extract characters and slices at given positions, and to be able to say "parse error at line X character Y" where X and Y are helpful to the user most of the time.

I would be absolutely fine with working in UTF-8 bytes only (and that would be faster I guess), but there would be a more pressing need to recompute character positions (as a code point or grapheme index) from byte offsets at times.

There are more abstract parsing methods where parser subroutines are implemented in a position agnostic way, but I'm very happy with my simple method.

If everything works on graphemes instead of code points (as I think does Perl6) I will be happy to use that, but it's not so important from a practical standpoint.

saagarjha8y ago

The issue with your model is that’s graphemes ultimately don’t behave like you may expect a character to. For example, it may take multiple code points to make a grapheme, so getting the index of any particular one might require walking the string instead of a constant time access since any one code point could be “globbed” with its neighbors into a grapheme–in a way that is dependent on its neighbors.

Avernar8y ago

> I definitely need indexes

No you don't. You need iterators, which behave like pointers. Let's say you're hundreds or thousands of characters into a string at the start of some token. Now you want to scan from that position to the end of the token.

With indexes it works fast only if it's by codepoint. in a language that properly supports graphemes this would mean it would have to scan from the beginning to get to that index.

With iterators it can start scanning from that position directly. Same speed no matter where you are in the string. With indexes the larger your input the slower your parse gets, and not in a linear way.

It's also super easy to get a slice using a start and end iterator. As for line x character y messages, you can't get that directly from an index as it depends on how many new lines you parsed so indexing doesn't help there.

1 more reply

dom08y ago

Formal parsers do without indexing, but those rolled by hand often do, for simplicity's sake. I think these cases can still be serviced by permitting indexes, but backing them by a lazily computed index table.

masklinn8y ago

> I get that the idea was to maintain indexing via codepoint, but (again) in practice that's not great

Of course not, but it was considered that breaking O(1) indexing guarantees were a bridge too far even in the breaky release of Python 3.

eighthnate8y ago

UTF-16 is a right choice if you are in asia or outside of the west. UTF-8 is the right choice if you are in america or using the internet.

> A better solution is to allow programmers to specify string encoding and default it to UTF-8.

Agreed. UTF-8 is the sensible default for most people.

falsedan8y ago

> Any benefit you get from using UTF-16 vanishes the moment you need to operate on it like a string, in other words.

So, don't decode to a string, and do all your character manipulation on the bytes.

> A better solution is to allow programmers to specify string encoding and default it to UTF-8.

Absolutely not: the internal representation of a string should be of no interest to a user of your language. The 'best' solution is to represent strings as a list of index lookups into a palette, and to update the palette as new graphemes are seen. This is similar to the approach Perl6 is using[0].

[0]: https://6guts.wordpress.com/2015/12/05/getting-closer-to-chr...

dotancohen8y ago

> So, don't decode to a string, and do all your character manipulation on the bytes.

WHAT?!? I suppose that you've only ever worked with Latin characters. Please show a code example of changing European to African in this sentence in your language of choice, working on the bytes in any multi-byte encoding:

מהי מהירות האווירית של סנונית ארופאית ללא משא?‏

Yes, that is a Hebrew Monty Python quote. Now try it with a smiley somewhere in the string (HN filtered out my attempt to post the string with a smiley).

Is each application to maintain their own dictionary of code points? If the map is to be in a library, then why not have it in the language itself?

1 more reply

hasenj8y ago

Except in East Asia with a population of over one billion.

camgunz8y ago

Yeah I think UTF-8 is pretty euro-centric and that doesn't get enough play. Being able to set the default string encoding in your program would do a lot to alleviate that, I wish there were a language that provided it.

Animats8y ago· 8 in thread

Python took the obvious approach - they already had UTF-16 and UTF-32 builds, so this was just making that mechanism dynamic.

Go and Rust expose UTF-8 at the byte level. This is something of a headache and may result in invalid string slices. It basically punts the problem back to the user.

Here's an alternative: Use UTF-8 as the internal representation, but don't expose it to the user.

If you're iterating over a string one rune or one grapheme at a time, the UTF-8 substructure is hidden from the user. Only if the user uses an explicit numeric subscript do you need to know a rune's position in string. When a request by subscript comes in, scan the string and build an index of rune subscript->byte position. This is expensive, but no worse than UTF-32 in space usage or expansion to UTF-32 in time.

Optimizations:

- Requests for s[0] to s[N], and s[-1] to s[-N], for small N, should be handled by working forwards or backwards through the UTF-8. (Yes, you can back up by rune in UTF-8. That's one of the neat features of the representation.)

- Lookup functions such as "index" should return an opaque type which represents the position into that string. If such an object is used as a subscript, there's no need to build the index by rune. If you coerce this opaque type into an integer, the index table has to be built. Adding or subtracting small integers from this opaque type should be supported by working backwards or forwards in the string.

- Regular expression processing has to be UTF-8 aware. It shouldn't need an index by rune.

This would maintain Python's existing semantics while reducing memory consumption.

Some performance measurement tool that finds all the places where an index by rune has to be built is useful. It's rare that you really need this, but sometimes you do.

hsivonen8y ago

> Go and Rust expose UTF-8 at the byte level. This is something of a headache and may result in invalid string slices. It basically punts the problem back to the user.

In Go, yes. In Rust, no. UTF-8 in Go is garbage in, garbage out. Rust, however, won't let you materialize an invalid &str without "unsafe".

mjevans8y ago

The difference is that Go expects your majority use case to be copy or concatenate. If you're taking a string sequence value you're normally not going to change it, or you're going to combine it together with something else. If you have valid UTF-8 input, you should get output that is valid, but might not be 'normalized' to a single form. IF you care about normalizing you can decide when to do that (usually in output construction).

If you need to make a decision based on the content of a string, then you often need to make a normalized (the same way for both) copy the inputs.

Most importantly, if you feed in garbage, you get out the SAME garbage. The real world, and historical data, are messy. Trying to be smart can often lead to the most disastrous consequences. Being conservative and tolerant allows for intentional planning to handle the conversion at the source, if and when desired.

hackits8y ago

> Go and Rust expose UTF-8 at the byte level.

Or you can take the C++/C approach and have a character 1 byte, 2 bytes, or a multi-byte. It's a pain in the ass to constantly in C/C++ having to interface between two libraries that one decided to use char and another w_char!

1 more reply

drej8y ago

> Go and Rust expose UTF-8 at the byte level. This is something of a headache and may result in invalid string slices. It basically punts the problem back to the user.

There are pros and cons to both approaches. The prime ones being that []byte allows for easy random access, whereas []rune usually takes O(n) to work with (unless you store rune lengths separately, which is memory intensive).

I guess it's about the right level of abstraction, so that you can choose if you're working with bytes (binary I/O, when you know it's ascii etc.) and when with runes (most situations).

I still haven't decided whether I prefer the Python approach or the Go one.

ghayes8y ago

If you look over Elixir's doc [0] on binary strings, they take the best of both worlds. The APIs are specifically crafted for least surprise, e.g. with `String.length()`, `byte_size`, `Strings.graphemes()` and `String.codepoints()` functions.

[0] https://hexdocs.pm/elixir/String.html

mjw10078y ago

I think the devil is in the details of that opaque 'position' type.

With integers you can do things like concatenate two strings and adjust the indexes referring to the second string by adding the length of the first one. If you invent a new position type you have to add support for several things like this.

In any case I think the Python people were right to carry on using integers.

Animats8y ago

That would force a conversion from opaque type to integer, which would force creation of the rune to byte index. It doesn't have to be handled as a special case. The opaque type thing is an optimization hidden from the user. If you try to look at it, you get the integer value, expensively.

1 more reply

masklinn8y ago

> Go and Rust expose UTF-8 at the byte level. This is something of a headache and may result in invalid string slices.

Rust will panic on invalid slices unless you first convert to raw bytes, and then it will not allow converting invalid slices back to a string in safe rust (in unsafe you're obviously on your own).

Safe Rust guarantees and requires[0] that strings are valid UTF8 at all times.

That aside, essentially all of your desires are part of Swift's string, you should check them out.

> Requests for s[0] to s[N], and s[-1] to s[-N], for small N, should be handled by working forwards or backwards through the UTF-8. (Yes, you can back up by rune in UTF-8. That's one of the neat features of the representation.)

Rust does that through the `chars()` iterator[1] which iterates through USVs (codepoints) and can be iterated from both ends. Sadly unlike Swift it does not ship with a grapheme cluster iterator. Happily there is a unicode_segmentation crate[2]. Swift also uses iterators but has more of them: the default iteration works on extended grapheme clusters, and alternate iterators are USV, UTF-16 and UTF-8.

If indexing is necessary for some reason Rust also has char_indices() which iterates on the USV and its (byte) position in the string.

> - Lookup functions such as "index" should return an opaque type which represents the position into that string. If such an object is used as a subscript, there's no need to build the index by rune. If you coerce this opaque type into an integer, the index table has to be built. Adding or subtracting small integers from this opaque type should be supported by working backwards or forwards in the string.

That is what Swift does. `String.index(of:String)` will return a String.Index: https://developer.apple.com/documentation/swift/string.index and indexed String methods will work based on that index type. This includes "reindexing" (offsetting) which is done using String.index(String.Index, offsetBy: String.IndexDistance). Furthermore String exposes two built-in indexes startIndex and endIndex as well as an "indices" iterator.

> This would maintain Python's existing semantics while reducing memory consumption.

It would not maintain O(1) USV indexing (especially in the C API), which was the reason for not just switching to UTF8.

In fact, FSR strings already contain a full UTF8 representation of the string[3], which the latin1 representation can share for pure ASCII strings.

[0] a non-utf8 str is one of Rust's 10 undefined behaviours, part of the "invalid primitive values" section alongside null references or invalid booleans: https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html

[1] https://doc.rust-lang.org/std/primitive.str.html#method.char...

[2] https://kbknapp.github.io/clap-rs/unicode_segmentation/index...

[3] https://github.com/python/cpython/blob/49b2734bf12dc1cda80fd...

Avernar8y ago· 4 in thread

I'm not a fan of how Python 3 stores Unicode strings internally. In my opinion they should have went with UTF-8. The extra scanning and conversion puts more preassure on the processor and caches under load.

I agree that Python 2's Unicode handling is broken. That's why I just stored UTF-8 in a normal string and avoided the whole mess. The only thing I have to do is validate any input from the outside world is really UTF-8.

ubernostrum8y ago

Since the high-level API is supposed to let you treat a string as a sequence of code points, a correct implementation (which Python didn't have until 3.3!) would've imposed the overhead of conversion to something resembling a fixed-width encoding whenever a programmer invoked certain operations.

And the vast majority of strings in real-world Python contain only code points also present in latin-1, which means they can be stored in one byte per code point with this approach. And for strings which can't be stored in one byte per code point, you were similarly going to pay the price sooner or later.

Avernar8y ago

> Since the high-level API is supposed to let you treat a string as a sequence of code points,

I disagree with that premise. It should operate on grapheme clusters. Operating on code points falls into the same trap as operating on bytes.

> a correct implementation (which Python didn't have until 3.3!) would've imposed the overhead of conversion to something resembling a fixed-width encoding whenever a programmer invoked certain operations.

Those operations should have been removed. Indexing is the big one that needs fixed width internal representation for speed. Code could have been rewritten to not require indexing. But mechanical translation from Python 2 to 3 was a goal and because of that they couldn't radically change the unicode API for the better.

> And the vast majority of strings in real-world Python contain only code points also present in latin-1, which means they can be stored in one byte per code point with this approach. And for strings which can't be stored in one byte per code point, you were similarly going to pay the price sooner or later.

You're going to pay the price for 4 byte per codepoint strings quite often. A single emoji will blow up a latin-1 string to 4 times the size.

int_19h8y ago

> That's why I just stored UTF-8 in a normal string and avoided the whole mess.

This only works if every library that you use agrees with you on this, and treats all strings you pass to it as UTF-8 whenever encoding matters.

OTOH, if you don't care about that, then you might as well just use bytes everywhere, and get the same thing. At least in Python 3, with bytes, if a library does try to use it as a string, you'll get an error, rather than silent wrong output.

Avernar8y ago

> This only works if every library that you use agrees with you on this, and treats all strings you pass to it as UTF-8 whenever encoding matters.

Not exactly. The library just has to treat it as a string and not worry about the encoding (i.e. not try to encode it to/from the unicode type).

Only ran into this issue once and the library had an option to return everything as string so not a problem.

> At least in Python 3, with bytes, if a library does try to use it as a string, you'll get an error, rather than silent wrong output.

Bytes in Python 3 don't support string operators.

3 more replies

carapace8y ago· 4 in thread

Unicode is a horrible scam, the worst thing to happen to digital language representation. This is all just so much turd polishing.

(Also, that explanation of UTF-8 is crap. UTF-8 is beautiful quite apart from its utility, but you'd hardly know it from the article.)

I've said it before: Unicode is a conflation of a good idea and an impossible idea. The good idea is a standard mapping from numbers to little pictures. That's all ASCII was. The impossible idea is a digital code for every way humans write. It's a form of digital cultural imperialism.

Unicode Consortium et. al. are absurdly arrogant.

simonh8y ago

Critical rants that don't suggest a better alternative, or describe what a better alternative might look like even in outline, are rarely informative or persuasive.

carapace8y ago

Step One: Admit there's a problem.

I heard, "Tell me more about what you think would be better." Here goes:

For written languages that are well-served by a simple sequence of symbols (English, etc.) there is no problem: a catalog of the mappings from numbers to pictures is fine is all that is required. Put them in a sequence (anoint UTF-8 as the One True Encoding) and you're good-to-go.

For languages that are NOT well-served by this simple abstraction the first thing to do (assuming you have the requisite breadth and depth of linguistic knowledge) is to figure out simple formal systems that do abstract the languages in question. Then determine equivalence classes and standardize the formal systems.

Let the structure of the language abstraction be a "first-class" entity that has reference implementations. Instead of adding weird modifiers and other dynamic behavior to the code, let them be actual simple DSLs whose output is the proper graphics.

Human languages are like a superset of what computers can represent.

Here's the Unicode Standard[1] on Arabic:

> The basic set of Arabic letters is well defined. Each letter receives only one Unicode character value in the basic Arabic block, no matter how many different contextual appearances it may exhibit in text. Each Arabic letter in the Unicode Standard may be said to represent the inherent semantic identity of the letter. A word is spelled as a sequence of these letters. The representative glyph shown in the Unicode character chart for an Arabic letter is usually the form of the letter when standing by itself. It is simply used to distinguish and identify the character in the code charts and does not restrict the glyphs used to represent it.

They baldly admit that Unicode is not good for drawing Arabic. I find the phrase "the inherent semantic identity of the letter" to be particularly rich. It's nearly mysticism.

If it is inconvenient to try to represent a language in terms of a sequence of symbols, then let's represent it as a (simple) program that renders the language correctly, which allows us to shoehorn non-linear behavior into a sequence of symbols.

If you think about it, this is already what Unicode is doing with modifiers and such. If you read further in the Unicode Standard doc I quoted above you'll see that they basically do create a kind of DSL for dealing with Arabic.

I'm saying: make it explicit.

Don't try to pretend that Unicode is one big standard for human languages. Admit that the "space" of writing systems is way bigger and more involved than Latin et. al. Study the problem of representing writing in a computer as a first-class issue. Publish reference implementations of code that can handle each kind of writing system along with the catalog of numbered pictures.

From the Unicode Standard again:

> The Arabic script is cursive, even in its printed form. As a result, the same letter may be written in different forms depending on how it joins with its neighbors. Vow-els and various other marks may be written as combining marks called tashkil, which are applied to consonantal base letters. In normal writing, however, these marks are omitted.

Computer systems that are adapted to English are not going to work for Arabic. I'd love to use a language simpler than PostScript to draw Arabic! Unicode strings are not that language.

Consider the "Base-4 fractions in Telugu" https://blog.plover.com/math/telugu.html

The fact that we have a way to represent the graphics ౦౼౽౾౸౹౺౻ is great! But any software that wants to use them properly will require some code to translate to and from numbers in the computer to Telugu sequences of those graphics.

Let that be part of "Unicode" and I'll shut up. In the meantime, I feel like it's a huge scam and a kind of cultural imperialism from us hacker types to the folks who are late to the party and for whom ASCII++ isn't going to really cut it.

To sum up: I think the thing that replaces Unicode for dealing with human languages in digital form should:

A.) Be created by linguists with help from computer folks, not by computer folks with some nagging from linguists (apologies to the linguist/computer folk who actually did the stuff.)

B.) We should clearly state the problems first: What are the ways that human language are written down?

C.) Write specific DSLs for each kind of writing. Publish reference implementations.

I think that's it. Are you informed? Persuaded even? Entertained at least? ;-)

[1] http://www.unicode.org/versions/Unicode9.0.0/ch09.pdf

5 more replies

ubernostrum8y ago

that explanation of UTF-8 is crap. UTF-8 is beautiful quite apart from its utility, but you'd hardly know it from the article

My goal was not to judge UTF-8 aesthetically, but to explain how it works and point out that it's a variable-width encoding which emphasizes its compatibility with ASCII for strings containing only code points <= U+007F.

Unicode Consortium et. al. are absurdly arrogant.

I would agree that Unicode as it exists today involves some historical and historic bad decisions. But again, staying off value judgments with respect to Unicode itself since the point of the article was to explain how Python now handles it internally.

carapace8y ago

Oh, Hi there.

Apologies for being cranky. You did a great job explaining how Python now handles Unicode!

To me it was strange reading about UTF-32 first and then getting to UTF-8 from that context. It seemed to obscure the coolth and beauty of the format.

Overall a great article, sorry again for being so negative.

1 more reply

dguaraglia8y ago· 3 in thread

My favorite story about Python's handling of Unicode was when one of my coworkers did a hotfix for our Python website, wrote tests, confirmed everything worked as expected... but right before committing and pushing to production wrote a comment like:

# Apparently we expect the field to be in this format ¯\_(ツ)_/¯

Right above the code he'd just fixed.

Of course, the moment we pushed the update it brought production down, because the Python interpreter doesn't understand Unicode in source files unless you specify which encoding you are using.

After that, "¯\_(ツ)_/¯" became a synonym for his name on our HipChat server, heh.

ubernostrum8y ago

This would be the case in Python 2, where source code files are assumed to be ASCII-encoded unless there's an encoding comment at the top of the file.

In Python 3, source code files are assumed to be UTF-8.

ninkendo8y ago

Interesting that Python 2 couldn't fix that in a hotfix/point release... UTF-8 is backwards compatible with ASCII so it shouldn't break anything if source started being interpreted as UTF8. I'd be curious to see what their reasoning is.

2 more replies

dguaraglia8y ago

Correct, this was a codebase that still had some Pylons (gasp! Not even Pyramid, but legit Pylons) code.

wrs8y ago· 3 in thread

This was a pretty gutsy move on Python's part. The presence of a single emoji in an English string will blow up memory usage for the whole string by 4x. And because graphemes aren't 1:1 to code points, the O(1) indexing and length operations you bought with that trade-off will still confuse people who don't understand Unicode.

ubernostrum8y ago

As I said in the article, I think the overhead of adding yet more weirdness in the form of quirks of the internal encoding (which could vary according to how the Python interpreter was compiled!) is a bad thing to do on top of how much people seem to struggle mentally just to get Unicode all on its own.

Though I also think the struggle is mostly due to people being stuck in an everything-is-like-ASCII mindset, and though I didn't get into that, it's one big reason why I think UTF-8 is generally the wrong way to expose Unicode to a programmer, since it lets them think they can keep that cherished "one byte == one character" assumption right up until something breaks at 2AM on a weekend.

Personally I'd like everyone to just actually learn at least the things about Unicode that I went into here (such as why "one code point == one character" is a wrong assumption), and I think that'd alleviate a lot of the pain. I also avoided talking much about normalization, because too many people hear about it and decide they can just normalize to NFKC and go back to assuming code point/character equivalence post-normalization.

ekidd8y ago

> it's one big reason why I think UTF-8 is generally the wrong way to expose Unicode to a programmer, since it lets them think they can keep that cherished "one byte == one character" assumption right up until something breaks at 2AM on a weekend.

Unfortunately, as long as you believe that you can index into a Unicode string, your code is going to break. The only question is how soon.

I actually like UTF-8 because it will break very quickly, and force the programmer to do the right thing. The first time you hit é or € or ️an emoji, you'll have a multibyte character, and you'll need to deal with it.

All the other options will also break, but later on:

- If you use UTF-16, then é and € will work, but emoji will still result in surrogate pairs.

- If you use a 4-byte representation, then you'll be able to treat most emoji as single characters. But then somebody will build é from two separate code points as "e + U+0301 COMBINING ACUTE ACCENT", or you'll run into a flag or skin color emoji, and once again, you're back at square zero.

You can't really index Unicode characters like ASCII strings. Written language is just too weird for that. But if you use UTF-8 (with a good API), then you'll be forced to accept that "str[3]" is hopeless very quickly. It helps a lot if your language has separate types for "byte" and "Unicode codepoint", however, so you can't accidentally treat a single byte as a character.

4 more replies

Avernar8y ago

> I think UTF-8 is generally the wrong way to expose Unicode to a programmer, since it lets them think they can keep that cherished "one byte == one character" assumption right up until something breaks at 2AM on a weekend.

The solution to that is simple, don't let the programmer access individual bytes in a Unicode string.

Get rid of indexing into them and replace it with iterators. Make string handling functions work on code points at the very least but better yet on grapheme clusters. There's a little more to it than that but it's a good start.

Yes, people are still stuck in the ASCII mindset and can't seem to get away from thinking in bytes. But I belive it's the ability to index into strings is what's to blame and not the encoding used.

1 more reply

yarrel8y ago· 3 in thread

If this was phrased as a question it would be a trick one.

btym8y ago

"How Python does Unicode: Poorly."

scrollaway8y ago

Python 2 maybe. Python 3 does Unicode wonderfully well; I miss it whenever I'm working with other languages.

1 more reply

ubernostrum8y ago

I'd be curious to hear why you think that.

1 more reply

mark-r8y ago· 2 in thread

I've always been curious on how this change in 3.3 impacts the C/C++ interface. I don't really know where to look it up, and since I haven't yet had to code a C++ library for Python I've had no burning need to answer the question.

ubernostrum8y ago

The Python C API grew some new functions and constants which are aware of what's going on and can tell you what encoding a particular Unicode object is using, read from/write to it, etc. The pre-3.3 APIs have a lot of deprecations in favor of the new API. If you want to use new API on a Unicode string created via old API, you have to use the new PyUnicode_READY() on it first.

masklinn8y ago

https://www.python.org/dev/peps/pep-0393/ has details down to C API changes related to the FSR implementation.

baby8y ago· 1 in thread

Question: if you read a file is there an algorithm that will make sure it is parsing the right encoding?

pmyteh8y ago

Not reliably, no. You can detect if it's an invalid string according to the encoding you're currently using (value > 127 for ASCII, invalid surrogate pair for UTF-16) but there are lots of byte sequences that produce valid (but semantically meaningless) output in multiple encodings. To choose between them programmatically requires your algorithm to understand the meaning of the string as well as be able to decode it, which might be possible in limited domains, but is a very hard problem in general.

hprotagonist8y ago

http://bit.ly/unipain is my go-to reference whenever i get tripped up on what's going on with unicode in python.

it is significantly more sane in python 3.3+.

j / k navigate · click thread line to collapse