undefined | Better HN

0 pointsest13y ago0 comments

> Counting characters doesn't help.

Why? If you can count characters (code points) then it's natural that you can split or substring by characters.

Try this in javascript:

    '안녕하세요'.substr(2,2)

Internally Fixed length encoding is much faster than variable-length encoding.

> Unicode does not work that way.

It DOES.

> Splitting on characters is garbage.

You messed up Unicode in Python in so many levels. Those characters you seen in Python console is, actually not Unicode. These are just bytes in sys stdout that happens be to correctly decoded and properly displayed. You should always use the u'' for any kind of characters. '안녕하세요' is WRONG and may lead to unspecified behaviors, it depends on your source code file encoding, intepreter encoding and sys default encoding, if you display them in console it depends on the console encoding, if it's GUI or HTML widget it depends on the GUI widget or content-type encoding.

> I'm not even leaving the BMP and it's broken!

Your unicode-fu is broken. Looks like your example provided identical Korean strings, which might be ICU module in Chrome auto normalized for you.

> You can't split decomposed Korean on character boundaries.

In a broken unicode implementation, like Chrome browser v8 js engine.

> I happen to be using Python 3. It is internally using UCS-4.

For the love of BDFL read this

http://www.python.org/dev/peps/pep-0414/

http://docs.python.org/3/whatsnew/3.3.html

0 comments

codeka13y ago

I'm sorry but you're wrong. I suggest you inform yourself better of the subject you're talking about before you call people "ignorant morons" next time.

dietrichepp is talking about Normalized Form D, which is a valid form of Unicode and cannot be counted using codepoints like you're doing.

Maybe you can try:

'𠀋'.substr(0,1)

estOP13y ago

yeah sure why not.

    >>> u'𡘓'[0:1]
    u'\U00021613'

    >>> u'Hi, Mr𡘓'[-1]
    u'\U00021613

    >>> u'𠀋'[0:1]
    u'\U0002000b'

Javascript won't work because UCS2 in js engine, duh.

Actually Javascript is messed up with Unicode string and binary strings, that's why Nodejs invented Buffer

http://nodejs.org/api/buffer.html

codeka13y ago

You've moved the goalposts:

  u'\U00021613'

This is a UTF-32 code unit, not a UTF-16 code unit. Even UTF-32 doesn't help when you have combining characters. I suggest you read dietrichepp's post again, he's talking about Normalization Form D.

1 more reply

cmccabe13y ago

Code points aren't letters.

Consider the following sequence of code points: U+0041 U+0308 [edit: corrected sequence]

That equals this european letter: Ä

Two code points, one letter. MAGIC! You can also get the same-looking letter with a single code point using U+00C4 (unicode likes redundancy).

Not all languages have letters. Not all languages that have letters represent each one with a single code point. Please think twice before calling people "morons."

estOP13y ago

> Two code points, one letter.

Yes I under stand there are million ways to display the same shape using various unicode. But how does that make code point counting impossible?

AND if you explictly using COMBINING DIAERESIS instead of single U+00C4, counting diaeresis separately is wrong somehow?

Why don't we make a law stating that both ae and æ is single letter?

cmccabe13y ago

I am responding to your earlier post which announced that UCS2 is better than UTF8 internally because it counts unicode characters faster than UTF8. Hopefully now you understand that just taking the number of UCS2 bytes and dividing by 2 does not give you the number of letters.

Just in case you don't, let's walk through it again.

UCS-16 big-endian represenation of Ä:

0x00 0x41 0x03 0x08

Another UCS-16 big-endian representation of Ä:

0x00 0xc4

If you look at the number of bytes, the first example has 4. It represents one letter. The second example has 2. It also represents one letter. Conclusion: UCS2 does not "count unicode characters faster than UTF8." You still have to look at every byte to see how many letters you have, same as in UTF-8.

Do you grasp this? If not, maybe you are one of those "ascii-centric ignorant morons" I keep hearing so much about.

1 more reply

j / k navigate · click thread line to collapse

0 pointsest13y ago0 comments

> Counting characters doesn't help.

Why? If you can count characters (code points) then it's natural that you can split or substring by characters.

Try this in javascript:

    '안녕하세요'.substr(2,2)

Internally Fixed length encoding is much faster than variable-length encoding.

> Unicode does not work that way.

It DOES.

> Splitting on characters is garbage.

> I'm not even leaving the BMP and it's broken!

Your unicode-fu is broken. Looks like your example provided identical Korean strings, which might be ICU module in Chrome auto normalized for you.

> You can't split decomposed Korean on character boundaries.

In a broken unicode implementation, like Chrome browser v8 js engine.

> I happen to be using Python 3. It is internally using UCS-4.

For the love of BDFL read this

http://www.python.org/dev/peps/pep-0414/

http://docs.python.org/3/whatsnew/3.3.html

0 comments

codeka13y ago

I'm sorry but you're wrong. I suggest you inform yourself better of the subject you're talking about before you call people "ignorant morons" next time.

dietrichepp is talking about Normalized Form D, which is a valid form of Unicode and cannot be counted using codepoints like you're doing.

Maybe you can try:

'𠀋'.substr(0,1)

estOP13y ago

yeah sure why not.

    >>> u'𡘓'[0:1]
    u'\U00021613'

    >>> u'Hi, Mr𡘓'[-1]
    u'\U00021613

    >>> u'𠀋'[0:1]
    u'\U0002000b'

Javascript won't work because UCS2 in js engine, duh.

Actually Javascript is messed up with Unicode string and binary strings, that's why Nodejs invented Buffer

http://nodejs.org/api/buffer.html

codeka13y ago

You've moved the goalposts:

  u'\U00021613'

This is a UTF-32 code unit, not a UTF-16 code unit. Even UTF-32 doesn't help when you have combining characters. I suggest you read dietrichepp's post again, he's talking about Normalization Form D.

1 more reply

cmccabe13y ago

Code points aren't letters.

Consider the following sequence of code points: U+0041 U+0308 [edit: corrected sequence]

That equals this european letter: Ä

Two code points, one letter. MAGIC! You can also get the same-looking letter with a single code point using U+00C4 (unicode likes redundancy).

Not all languages have letters. Not all languages that have letters represent each one with a single code point. Please think twice before calling people "morons."

estOP13y ago

> Two code points, one letter.

Yes I under stand there are million ways to display the same shape using various unicode. But how does that make code point counting impossible?

AND if you explictly using COMBINING DIAERESIS instead of single U+00C4, counting diaeresis separately is wrong somehow?

Why don't we make a law stating that both ae and æ is single letter?

cmccabe13y ago

Just in case you don't, let's walk through it again.

UCS-16 big-endian represenation of Ä:

0x00 0x41 0x03 0x08

Another UCS-16 big-endian representation of Ä:

0x00 0xc4

Do you grasp this? If not, maybe you are one of those "ascii-centric ignorant morons" I keep hearing so much about.

1 more reply

j / k navigate · click thread line to collapse