undefined | Better HN

0 pointsest12y ago0 comments

Because some of us are pissed that some BMP characters takes 3 bytes in UTF8, that's 50% more waste of storage space and 50% more time to read/write.

I like the design of Python 3.3 encoding. ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes.

http://www.python.org/dev/peps/pep-0393/

0 comments

13 comments · 3 top-level

stormbrew12y ago· 4 in thread

That's a reasonable replacement for ucs-4 for an internal representation, but it's not actually a character encoding like utf-8 and utf-16 are. It's just a tagged union of several encodings.

As for the inflation issue, 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8 and 2 bytes in utf-16. It tends to even out somewhat. And if you really want your data to be small, gzip will do a better job than either.

estOP12y ago

> 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8

For latin alphabets, yes. For CJK, it's really bad. Things get worse if you dealt with non-BMP before, like iOS emoji, which force you to upgrade MySQL to support utf8mb4, which is totally bullshit. (why the hell do people even presume utf8 is max 3 bytes?)

rspeer12y ago

Because people either don't know anything outside of the BMP exists, or they think astral characters are only for dead languages (they haven't had the dawning realization about emoji yet), or they use a programming language like Java that accidentally implemented CESU-8 and called it "UTF8" a decade and a half ago and isn't allowed to fix it.

One interesting conclusion from looking at the state of Twitter (http://blog.luminoso.com/2013/09/04/emoji-are-more-common-th...) is that CESU-8 is probably more common than real UTF-8.

Another fun thing I ran into today is that Python regular expressions allow astral characters, but you can't safely use them until 3.3 because narrow builds will quietly replace them with nonsense that doesn't run (https://github.com/LuminosoInsight/python-ftfy/commit/86aa65...). And the very reason this came up was in a workaround for a different bug in 3.3.

kps12y ago

    > ... MySQL ...
    > why the hell do people even presume utf8 is max 3 bytes?

I think you answered your own question before you even asked.

rspeer12y ago

Except most text isn't plain text. HTML pages in CJK are still smaller in UTF-8 than in their respective countries' favorite encodings.

3 more replies

deathanatos12y ago· 3 in thread

The good point (in my opinion) is not that "ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes", but rather that the exposed API hides this from you, and exposes to you a sequence of code points. This, I hope, will reduce errors, as code points, not code units, is often a better abstraction to be working with. (For some random string processing function.)

So far as I know, Haskell is the only other language that I know of that exposes, as the defaultish-native interface, Unicode strings as a sequence or iterable of code points (by just using UTF-32). Java, C#, your-language-here all do code units. C++'s template are powerful enough that someone could make unicode_str<encoding_to_store_as>, but I've not seen one.

See: http://www.unicode.org/glossary/#code_point http://www.unicode.org/glossary/#code_unit

millstone12y ago

Code points is a better abstraction than code units, but it's still a piss-poor abstraction.

Consider the problem of producing a valid substring from a Unicode string. It's important that you not split surrogate pairs, and it's true working with code points spares you from that particular problem. But it's also important that you not split combining marks, and zero width joiners, and Hangul syllables... (see http://www.unicode.org/reports/tr29/ for all the gory details).

An average programmer cannot correctly extract a substring from a Unicode string whether given the code units or the code points. These abstractions are inadequate: instead you want something like grapheme clusters.

pyre12y ago

This was my reaction too. It's Unicode all the way down... :)

cmccabe12y ago

Go allows you to iterate over a string as a series of code points.

sillysaurus212y ago· 3 in thread

Storage space is cheap, and the price continues to fall. Storage of text is virtually nothing. Bandwidth to send text is almost nothing. Also, most text is compressed, which virtually eliminates that concern.

Programmer time is at least two orders of magnitude more expensive than storage space or bandwidth for text.

erichurkman12y ago

> Storage space is cheap, and the price continues to fall.

At-rest storage is cheap. Memory is cheaper than it used to be, but CPU cache is not. At some point the text will have to cross the CPU where every byte still counts.

estOP12y ago

> Storage space is cheap

True, but

1. time is precious. For example, you waste 50% more time for a fulltext indexing scan because utf8 is longer.

2. Memory. If you can't hold text in a single machine, you have bigger issues (e.g. clustering algorithms, persistency, redundancy, etc.)

3. Network transfer. If you can save 50% in a db connection rtt, you save a lot.

It makes no sense to save BMP in 3 bytes anyway.

acdha12y ago

You'd have to have rather weird data for it to be anywhere near 50% larger for real text (i.e. even if you only use Chinese, if you have punctuation, arabic numerals, quotes or URLs, HTML, etc. the averages cancel more than you might think) and a completely incompetent search engine design for that to remotely approach 50% more time to query or index.

If you were assigned the task of indexing the UTF-8 worst case corpus, nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages UTF-8 gives you in every other area. Y internal details like compression are much easier to change than dealing with external interfaces which must be coordinated (this is why JavaScript still has such painful Unicode support even though browsers handle almost everything well in markup)

1 more reply

j / k navigate · click thread line to collapse

0 comments

13 comments · 3 top-level

stormbrew12y ago· 4 in thread

That's a reasonable replacement for ucs-4 for an internal representation, but it's not actually a character encoding like utf-8 and utf-16 are. It's just a tagged union of several encodings.

estOP12y ago

> 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8

rspeer12y ago

One interesting conclusion from looking at the state of Twitter (http://blog.luminoso.com/2013/09/04/emoji-are-more-common-th...) is that CESU-8 is probably more common than real UTF-8.

kps12y ago

    > ... MySQL ...
    > why the hell do people even presume utf8 is max 3 bytes?

I think you answered your own question before you even asked.

rspeer12y ago

Except most text isn't plain text. HTML pages in CJK are still smaller in UTF-8 than in their respective countries' favorite encodings.

3 more replies

deathanatos12y ago· 3 in thread

See: http://www.unicode.org/glossary/#code_point http://www.unicode.org/glossary/#code_unit

millstone12y ago

Code points is a better abstraction than code units, but it's still a piss-poor abstraction.

pyre12y ago

This was my reaction too. It's Unicode all the way down... :)

cmccabe12y ago

Go allows you to iterate over a string as a series of code points.

sillysaurus212y ago· 3 in thread

Programmer time is at least two orders of magnitude more expensive than storage space or bandwidth for text.

erichurkman12y ago

> Storage space is cheap, and the price continues to fall.

At-rest storage is cheap. Memory is cheaper than it used to be, but CPU cache is not. At some point the text will have to cross the CPU where every byte still counts.

estOP12y ago

> Storage space is cheap

True, but

1. time is precious. For example, you waste 50% more time for a fulltext indexing scan because utf8 is longer.

2. Memory. If you can't hold text in a single machine, you have bigger issues (e.g. clustering algorithms, persistency, redundancy, etc.)

3. Network transfer. If you can save 50% in a db connection rtt, you save a lot.

It makes no sense to save BMP in 3 bytes anyway.

acdha12y ago

1 more reply

j / k navigate · click thread line to collapse