undefined | Better HN

0 pointshk__29mo ago0 comments

What do you mean? What would you suggest instead? Fixed-length encoding? It would take a looot of space given all the character variations you can have.

0 comments

19 comments · 1 top-level

gertop9mo ago· 18 in thread

UTF-16 is both simpler to parse and more compact than utf-8 when writing non-english characters.

UTF-8 didn't win on technical merits, it won becausw it was mostly backwards compatible with all American software that previously used ASCII only.

When you leave the anglosphere you'll find that some languages still default to other encodings due to how large utf-8 ends up for them (Chinese and Japanese, to name two).

jcranmer9mo ago

> UTF-16 is both simpler to parse and more compact than utf-8 when writing non-english characters.

UTF-8 and UTF-16 take the same number of characters to encode a non-BMP character or a character in the range U+0080-U+07FF (which includes most of the Latin supplements, Greek, Cyrillic, Arabic, Hebrew, Aramaic, Syriac, and Thaana). For ASCII characters--which includes most whitespace and punctuation--UTF-8 takes half as much space as UTF-16, while characters in the range U+0800-U+FFFF, UTF-8 takes 50% more space than UTF-16. Thus, for most European languages, and even Arabic (which ain't European), UTF-8 is going to be more compact than UTF-16.

The Asian languages (CJK-based languages, Indic languages, and South-East Asian, largely) are the ones that are more compact in UTF-16 than UTF-8, but if you embed those languages in a context likely to have significant ASCII content--such as an HTML file--well, it turns out the UTF-8 still wins out!

> When you leave the anglosphere you'll find that some languages still default to other encodings due to how large utf-8 ends up for them (Chinese and Japanese, to name two).

You'll notice that the encodings that are used are not UTF-16 either. Also, my understanding is that China generally defaults to UTF-8 nowadays despite a government mandate to use GB18030 instead, so it's largely Japan that is the last redoubt of the anti-Unicode club.

GoblinSlayer9mo ago

And when you download many megabytes of jabbascript to render 4kb of text, how does it matter what encoding you use?

amake9mo ago

Even Japan is mostly Unicode these days.

ISV_Damocles9mo ago

UTF-16 is also just as complicated as UTF-8 requiring multibyte characters to cover the entirety of Unicode, so it doesn't avoid the issue you're complaining about for the newest languages added, and it has the added complexity of a BOM being required to be sure you have the pairs of bytes in the right order, so you are more vulnerable to truncated data being unrecoverable versus UTF-8.

UTF-32 would be a fair comparison, but it is 4 bytes per character and I don't know what, if anything, uses it.

Mikhail_Edoshin9mo ago

No, UTF-16 is much simpler in that aspect. And its design is no less brilliant. (I've written an state machine encoder and decoder for both these encodings.) If an application works a lot with text I'd say UTF-16 looks more attractive for the main internal representation.

1 more reply

adgjlsfhk19mo ago

python does (although it will use 8 or 16 bits per character if all characters in the string fit)

simonask9mo ago

All of Europe outside of the UK and Enligh-speaking Ireland need characters outside of ASCII, but most letters are ASCII. For example, the string "blåbærgrød" in Danish (blueberry porridge) has about the densest occurrence of non-ASCII characters, but that's still only 30%. It takes 13 bytes in UTF-8, but 20 bytes in UTF-16.

Spanish has generally at most one accented vowel (á, ó, ü, é, ...) per word, and generally at most one ñ per word. German rarely has more than two umlauts per word, and almost never more than one ß.

UTF-16 is a wild pessimization for European languages, and UTF-8 is only slightly wasteful in Asian languages.

kbolino9mo ago

Thanks to UTF-16, which came out after UTF-8, there are 2048 wasted 3-byte sequences in UTF-8.

And unlike the short-sighted authors of the first version of Unicode, who thought the whole world's writing systems could fit in just 65,536 distinct values, the authors of UTF-8 made it possible to encode up to 2 billion distinct values in the original design.

xigoi9mo ago

Thanks to UTF-8, there are 13 wasted 1-byte sequences in UTF-8 :P

1 more reply

airza9mo ago

It's all fun and games until you hit an astral plane character in utf-16 and one of the library designers didn't realize not all characters are 2 bytes.

rmunn9mo ago

Which is why I've seen lots of people recommend testing your software with emojis, particularly recently-added emojis (many of the earlier emojis were in the basic multilingual plane, but a lot of newer emojis are outside the BMP, i.e. the "astral" planes). It's particularly fun to use the (U+1F4A9) emoji for such testing, because of what it implies about the libraries that can't handle it correctly.

EDIT: Heh. The U+1F4A9 emoji that I included in my comment was stripped out. For those who don't recognize that codepoint by hand (can't "see" the Matrix just from its code yet?), that emoji's official name is U+1F4A9 PILE OF POO.

1 more reply

adgjlsfhk19mo ago

With BOM issues, UTF-16 is way more complicated. For Chinese and Japenese, UTF8 is a maximum of 50% bigger, but can actually end up smaller if used within standard file formats like JSON/HTML since all the formatting characters and spaces are single bytes.

cyphar9mo ago

UTF-16 is absolutely not easier to work with. The vast majority of bugs I remember having to fix that were directly related to encoding were related to surrogate pairs. I suspect most programs do not handle them correctly because they come up so rarely but the bugs you see are always awful. UTF-8 doesn't have this problem and I think that's enough reason to avoid UTF-16 (though "good enough" compatibility with programs that only understand 8-bit-clean ASCII is an even better practical reason). Byte ordering is also a pernicious problem (with failure modes like "all of my documents are garbled") that UTF-8 also completely avoids.

It is 33% more compact for most (but not all) CJK characters, but that's not the case for all non-English characters. However, one important thing to remember is that most computer-based documents contain large amounts of ASCII text purely because the formats themselves use English text and ASCII punctuation. I suspect that most UTF-8 files with CJK contents are much smaller than UTF-16 files, but I'd be interested in an actual analysis from different file formats.

The size argument (along with a lot of understandable contention around UniHan) is one of the reasons why UTF-8 adoption was slower in Japan and Shift-JIS is not completely dead (though mainly for esoteric historical reasons like the 漢検 test rather than active or intentional usage) but this is quite old history at this point. UTF-8 now makes up 99% of web pages.

cyphar9mo ago

I went through a Japanese ePUB novel I happened to have on hand (the Japanese translation of 1984) and 65% of the bytes are ASCII bytes. So in this case UTF-16 would end up resulting in something like 53% more bytes (going by napkin math).

You could argue that because it will be compressed (and UTF-16 wastes a whole NUL byte for all ASCII) that the total file-size for the compressed version would be better (precisely because there are so many wasted bytes) but there are plenty of examples where files aren't compressed and most systems don't have compressed memory so you will pay the cost somewhere.

But in the interest of transparency, a very crude test of the same ePUB yields a 10% smaller file with UTF-16. I think a 10% size penalty (in a very favourable scenario for UTF-16) in exchange for all of the benefits of UTF-8 is more than an acceptable tradeoff, and the incredibly wide proliferation of UTF-8 implies most people seem to agree.

syncsynchalt9mo ago

UTF-16 has endian concerns and surrogates.

Both UTF-8 and UTF-16 have negatives but I don't think UTF-16 comes out ahead.

Mikhail_Edoshin9mo ago

Here is what an UTF-8 decoder needs to handle:

1. Invalid bytes. Some bytes cannot appear in an UTF-8 string at all. There are two ranges of these.

2. Conditionally invalid continuation bytes. In some states you read a continuation byte and extract the data, but in some other cases the valid range of the first continuation byte is further restricted.

3. Surrogates. They cannot appear in a valid UTF-8 string, so if they do, this is an error and you need to mark it so. Or maybe process them as in CESU but this means to make sure they a correctly paired. Or maybe process them as in WTF-8, read and let go.

4. Form issues: an incomplete sequence or a continuation byte without a starting byte.

It is much more complicated than UTF-16. UTF-16 only has surrogates that are pretty straightforward.

2 more replies

Iwan-Zotow9mo ago

There are no sane Chinese Japanese people who uses old encodings. None

kccqzy9mo ago

Two decades ago the typical simplified Chinese website did in fact use GB2312 and not UTF-8; traditional Chinese website used Big5; Japanese sites used Shift JIS. These days that's not true at all. Your comment is twenty years out of date.

j / k navigate · click thread line to collapse

0 comments

19 comments · 1 top-level

gertop9mo ago· 18 in thread

UTF-16 is both simpler to parse and more compact than utf-8 when writing non-english characters.

UTF-8 didn't win on technical merits, it won becausw it was mostly backwards compatible with all American software that previously used ASCII only.

When you leave the anglosphere you'll find that some languages still default to other encodings due to how large utf-8 ends up for them (Chinese and Japanese, to name two).

jcranmer9mo ago

> UTF-16 is both simpler to parse and more compact than utf-8 when writing non-english characters.

> When you leave the anglosphere you'll find that some languages still default to other encodings due to how large utf-8 ends up for them (Chinese and Japanese, to name two).

GoblinSlayer9mo ago

And when you download many megabytes of jabbascript to render 4kb of text, how does it matter what encoding you use?

amake9mo ago

Even Japan is mostly Unicode these days.

ISV_Damocles9mo ago

UTF-32 would be a fair comparison, but it is 4 bytes per character and I don't know what, if anything, uses it.

Mikhail_Edoshin9mo ago

1 more reply

adgjlsfhk19mo ago

python does (although it will use 8 or 16 bits per character if all characters in the string fit)

simonask9mo ago

UTF-16 is a wild pessimization for European languages, and UTF-8 is only slightly wasteful in Asian languages.

kbolino9mo ago

Thanks to UTF-16, which came out after UTF-8, there are 2048 wasted 3-byte sequences in UTF-8.

xigoi9mo ago

Thanks to UTF-8, there are 13 wasted 1-byte sequences in UTF-8 :P

1 more reply

airza9mo ago

It's all fun and games until you hit an astral plane character in utf-16 and one of the library designers didn't realize not all characters are 2 bytes.

rmunn9mo ago

1 more reply

adgjlsfhk19mo ago

cyphar9mo ago

syncsynchalt9mo ago

UTF-16 has endian concerns and surrogates.

Both UTF-8 and UTF-16 have negatives but I don't think UTF-16 comes out ahead.

Mikhail_Edoshin9mo ago

Here is what an UTF-8 decoder needs to handle:

1. Invalid bytes. Some bytes cannot appear in an UTF-8 string at all. There are two ranges of these.

4. Form issues: an incomplete sequence or a continuation byte without a starting byte.

It is much more complicated than UTF-16. UTF-16 only has surrogates that are pretty straightforward.

2 more replies

Iwan-Zotow9mo ago

There are no sane Chinese Japanese people who uses old encodings. None

kccqzy9mo ago

j / k navigate · click thread line to collapse