undefined | Better HN

0 pointsdivingdragon3mo ago0 comments

Really, as an East Asian language user the rest of the comments here make me want to scream.

0 comments

4 comments · 2 top-level

exceptione3mo ago· 2 in thread

I am not sure if you mean me, as I just asked a question. I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.

gfody3mo ago

> I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.

there are over a million codepoints in unicode, thousands for latin and other language agnostic symbols emojis etc. utf-8 is designed to be backwards compatible with ascii, not to efficiently encode all of unicode. utf-16 is the reasonably efficient compromise for native unicode applications hence it being the internal format of strings in C# and sql server and such.

the folks bleating about utf-8 being the best choice make the same mistake as the "utf-8 everywhere manifesto" guys: stats skewed by a web/american-centric bias - sure utf-8 is more efficient when your text is 99% markup and generally devoid of non-latin scripts, that's not my database and probably not most peoples

exceptione3mo ago

  > sure utf-8 is more efficient when your text is 99% markup and generally devoid of non-latin scripts, that's not my database and probably not most peoples

I think this website audience begs to differ. But if you develop for S.Asia, I can see the pendulum swings to utf-16. But even then you have to account for this:

  «UTF-16 is often claimed to be more space-efficient than UTF-8 for East Asian languages, since it uses two bytes for characters that take 3 bytes in UTF-8. Since real text contains many spaces, numbers, punctuation, markup (for e.g. web pages), and control characters, which take only one byte in UTF-8, this is only true for artificially constructed dense blocks of text. A more serious claim can be made for Devanagari and Bengali, which use multi-letter words and all the letters take 3 bytes in UTF-8 and only 2 in UTF-16.»¹

In the same vein, with reference to³:

  «The code points U+0800–U+FFFF take 3 bytes in UTF-8 but only 2 in UTF-16. This led to the idea that text in Chinese and other languages would take more space in UTF-8. However, text is only larger if there are more of these code points than 1-byte ASCII code points, and this rarely happens in real-world documents due to spaces, newlines, digits, punctuation, English words, and markup.»²

The .net ecosystem isn't happy with utf-16 being the default, but it is there in .net and Windows for historical reasons.

  «Microsoft has stated that "UTF-16 [..] is a unique burden that Windows places on code that targets multiple platforms"»¹

___

1. https://en.wikipedia.org/wiki/UTF-16#Efficiency

2. https://en.wikipedia.org/wiki/UTF-8#Comparison_to_UTF-16

3. https://kitugenz.com/

1 more reply

gfody3mo ago

hn often makes me want to scream

j / k navigate · click thread line to collapse

0 comments

4 comments · 2 top-level

exceptione3mo ago· 2 in thread

gfody3mo ago

> I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.

exceptione3mo ago

  > sure utf-8 is more efficient when your text is 99% markup and generally devoid of non-latin scripts, that's not my database and probably not most peoples

I think this website audience begs to differ. But if you develop for S.Asia, I can see the pendulum swings to utf-16. But even then you have to account for this:

  «UTF-16 is often claimed to be more space-efficient than UTF-8 for East Asian languages, since it uses two bytes for characters that take 3 bytes in UTF-8. Since real text contains many spaces, numbers, punctuation, markup (for e.g. web pages), and control characters, which take only one byte in UTF-8, this is only true for artificially constructed dense blocks of text. A more serious claim can be made for Devanagari and Bengali, which use multi-letter words and all the letters take 3 bytes in UTF-8 and only 2 in UTF-16.»¹

In the same vein, with reference to³:

  «The code points U+0800–U+FFFF take 3 bytes in UTF-8 but only 2 in UTF-16. This led to the idea that text in Chinese and other languages would take more space in UTF-8. However, text is only larger if there are more of these code points than 1-byte ASCII code points, and this rarely happens in real-world documents due to spaces, newlines, digits, punctuation, English words, and markup.»²

The .net ecosystem isn't happy with utf-16 being the default, but it is there in .net and Windows for historical reasons.

  «Microsoft has stated that "UTF-16 [..] is a unique burden that Windows places on code that targets multiple platforms"»¹

___

1. https://en.wikipedia.org/wiki/UTF-16#Efficiency

2. https://en.wikipedia.org/wiki/UTF-8#Comparison_to_UTF-16

3. https://kitugenz.com/

1 more reply

gfody3mo ago

hn often makes me want to scream

j / k navigate · click thread line to collapse