I like the design of Python 3.3 encoding. ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes.
As for the inflation issue, 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8 and 2 bytes in utf-16. It tends to even out somewhat. And if you really want your data to be small, gzip will do a better job than either.
For latin alphabets, yes. For CJK, it's really bad. Things get worse if you dealt with non-BMP before, like iOS emoji, which force you to upgrade MySQL to support utf8mb4, which is totally bullshit. (why the hell do people even presume utf8 is max 3 bytes?)
One interesting conclusion from looking at the state of Twitter (http://blog.luminoso.com/2013/09/04/emoji-are-more-common-th...) is that CESU-8 is probably more common than real UTF-8.
Another fun thing I ran into today is that Python regular expressions allow astral characters, but you can't safely use them until 3.3 because narrow builds will quietly replace them with nonsense that doesn't run (https://github.com/LuminosoInsight/python-ftfy/commit/86aa65...). And the very reason this came up was in a workaround for a different bug in 3.3.
> ... MySQL ...
> why the hell do people even presume utf8 is max 3 bytes?
I think you answered your own question before you even asked.So far as I know, Haskell is the only other language that I know of that exposes, as the defaultish-native interface, Unicode strings as a sequence or iterable of code points (by just using UTF-32). Java, C#, your-language-here all do code units. C++'s template are powerful enough that someone could make unicode_str<encoding_to_store_as>, but I've not seen one.
See: http://www.unicode.org/glossary/#code_point http://www.unicode.org/glossary/#code_unit
Consider the problem of producing a valid substring from a Unicode string. It's important that you not split surrogate pairs, and it's true working with code points spares you from that particular problem. But it's also important that you not split combining marks, and zero width joiners, and Hangul syllables... (see http://www.unicode.org/reports/tr29/ for all the gory details).
An average programmer cannot correctly extract a substring from a Unicode string whether given the code units or the code points. These abstractions are inadequate: instead you want something like grapheme clusters.
Programmer time is at least two orders of magnitude more expensive than storage space or bandwidth for text.
At-rest storage is cheap. Memory is cheaper than it used to be, but CPU cache is not. At some point the text will have to cross the CPU where every byte still counts.
True, but
1. time is precious. For example, you waste 50% more time for a fulltext indexing scan because utf8 is longer.
2. Memory. If you can't hold text in a single machine, you have bigger issues (e.g. clustering algorithms, persistency, redundancy, etc.)
3. Network transfer. If you can save 50% in a db connection rtt, you save a lot.
It makes no sense to save BMP in 3 bytes anyway.
If you were assigned the task of indexing the UTF-8 worst case corpus, nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages UTF-8 gives you in every other area. Y internal details like compression are much easier to change than dealing with external interfaces which must be coordinated (this is why JavaScript still has such painful Unicode support even though browsers handle almost everything well in markup)