undefined | Better HN

story

0 pointsmasklinn10y ago0 comments

Turns out there are precomposed versions of these clusters, so your system might just be using these.

Could you retry with the input "देवनागरीदेवनागरी"?

0 comments

I'm not quite sure how to interpret the output as it doesn't render particularly kindly in my terminal:

  sub MAIN($s) {
  	say "{$s.chars}: $s";
  	my $b =  $s.substr(0,12);
  	say "{$b.chars}: $b";
  }

  $ perl6 hn-test2.p6 देवनागरीदेवनागरी
  16: देवनागरीदेवनागरी
  12: देवनागरीदेवन

masklinnOP10y ago

So apparently perl6 is also "wrong" and operates on codepoints, your system composed my original string and each (base, diacritic) pair was pasted as a single precomposed character (I expect that if you try out the Python version on your system you'll also get the "right" answer).

The new string is composed of 10 user-visible characters (5 character repeated twice) but 16 codepoints (and this time I carefully checked that there was no precomposed version):

    DEVANAGARI LETTER DA
    DEVANAGARI VOWEL SIGN E
    DEVANAGARI LETTER VA
    DEVANAGARI LETTER NA
    DEVANAGARI VOWEL SIGN AA
    DEVANAGARI LETTER GA
    DEVANAGARI LETTER RA
    DEVANAGARI VOWEL SIGN II
    DEVANAGARI LETTER DA
    DEVANAGARI VOWEL SIGN E
    DEVANAGARI LETTER VA
    DEVANAGARI LETTER NA
    DEVANAGARI VOWEL SIGN AA
    DEVANAGARI LETTER GA
    DEVANAGARI LETTER RA
    DEVANAGARI VOWEL SIGN II

Operating on codepoints, both versions cut after the second DEVANAGARI LETTER NA (न) breaking that grapheme cluster (it should be ना) and not displaying the final two clusters ग and री.

raiph10y ago

> So apparently perl6 is also "wrong" and operates on codepoints

Yes and no. Yes, because the in-development Rakudo compiler is clearly currently giving the wrong result, and no because it operates on grapheme clusters (but has bugs).

(You can work with codepoints if you really want to but the normal string/character functions that use the normal string type, Str, work -- or more accurately are supposed to work -- on the assumption that "character" == grapheme cluster; afaik it's supposed to match the Unicode default Extended Grapheme Cluster specification.)

Fwiw I've filed a bug: https://rt.perl.org/Ticket/Display.html?id=125927

hahainternet10y ago

Yeah you're right, a caveat in the docs says that current implementations aren't finished with this. I was under the impression the NFG work was done but I'll catch up with people on irc.

raiph10y ago

> I expect that if you try out the Python version on your system you'll also get the "right" answer.

I don't think so. In my tests standard python (2.7 and 3.5) ignores grapheme clusters.

masklinnOP10y ago

Python ignores grapheme cluster, that point was about my original test case using grapheme clusters I later found out had precomposed equivalent, so a transfer chain performing NFC would leave the test case with no combining characters (or multi-codepoint grapheme clusters) left in it.

1 more reply

j / k navigate · click thread line to collapse

0 comments

hahainternet10y ago

I'm not quite sure how to interpret the output as it doesn't render particularly kindly in my terminal:

  sub MAIN($s) {
  	say "{$s.chars}: $s";
  	my $b =  $s.substr(0,12);
  	say "{$b.chars}: $b";
  }

  $ perl6 hn-test2.p6 देवनागरीदेवनागरी
  16: देवनागरीदेवनागरी
  12: देवनागरीदेवन

masklinnOP10y ago

The new string is composed of 10 user-visible characters (5 character repeated twice) but 16 codepoints (and this time I carefully checked that there was no precomposed version):

    DEVANAGARI LETTER DA
    DEVANAGARI VOWEL SIGN E
    DEVANAGARI LETTER VA
    DEVANAGARI LETTER NA
    DEVANAGARI VOWEL SIGN AA
    DEVANAGARI LETTER GA
    DEVANAGARI LETTER RA
    DEVANAGARI VOWEL SIGN II
    DEVANAGARI LETTER DA
    DEVANAGARI VOWEL SIGN E
    DEVANAGARI LETTER VA
    DEVANAGARI LETTER NA
    DEVANAGARI VOWEL SIGN AA
    DEVANAGARI LETTER GA
    DEVANAGARI LETTER RA
    DEVANAGARI VOWEL SIGN II

Operating on codepoints, both versions cut after the second DEVANAGARI LETTER NA (न) breaking that grapheme cluster (it should be ना) and not displaying the final two clusters ग and री.

raiph10y ago

> So apparently perl6 is also "wrong" and operates on codepoints

Yes and no. Yes, because the in-development Rakudo compiler is clearly currently giving the wrong result, and no because it operates on grapheme clusters (but has bugs).

Fwiw I've filed a bug: https://rt.perl.org/Ticket/Display.html?id=125927

hahainternet10y ago

Yeah you're right, a caveat in the docs says that current implementations aren't finished with this. I was under the impression the NFG work was done but I'll catch up with people on irc.

raiph10y ago

> I expect that if you try out the Python version on your system you'll also get the "right" answer.

I don't think so. In my tests standard python (2.7 and 3.5) ignores grapheme clusters.

masklinnOP10y ago

1 more reply

j / k navigate · click thread line to collapse