undefined | Better HN

0 pointsrcoveson3y ago0 comments

My question is about the phrase up to 5. What in Unicode is up to 5? Codepoints are up to 4 in all the encodings I know. ZWJ sequences may as well be arbitrarily long. What is "up to 5"?

0 comments

9 comments · 3 top-level

WorldMaker3y ago· 3 in thread

I read this as colloquial English for "around" or "approximately". Not setting a bound limit, but setting an example size.

rcovesonOP3y ago

But you also read it as referring to ZWJ sequences? So the author has picked a number that is actually below average and they've worded it as up to...?

Saying a ZWJ sequence can be "up to 5 bytes" is like saying "the current generation of Intel processors run at clock speeds of up to 2 GHz".

If they were referring to ZWJ sequences (I don't think they were; I think they were just misremembering the maximum encoded length of a codepoint) and they had said "up to 35 bytes", then I might agree with you. It's still not technically accurate, but it's a reasonable colloquial usage, like saying "human males can grow up to seven feet tall".

WorldMaker3y ago

I think you are trying to read something that wasn't meant to be technical documentation as if it was trying to be exact technical specifications. I'm not the original author, so I don't have reason to litigate this any further, and I'm not sure what you are arguing about at this point.

1 more reply

z3t43y ago

Sorry I meant codepoint/characters, but it would not suprise me if there existed an encoding or language where my wording would be technically correct, but I do not know of any such encoding. I also did not know that there exist more then 5 combinations in Unicode, but I'm not supprised and my implementation is probably buggy. But I do challange you to test how well your favourite editor (terminal emulator cough) handles Unicode emojis.

2 more replies

sparkie3y ago· 2 in thread

Codepoints themselves could technically all fit into 3 bytes (or 21 bits to be precise), but there is no standard 3-byte encoding. The highest Unicode codepoint is 0x10FFFF.

An idea for a variable width encoding of 1 to 3 bytes: Read the MSB of each byte: If it's 0, don't read any more bytes. If it's 1, read the next byte. Do the same (up to 3 times). The non MSB bits of each byte then make up the codepoint.

    0xxxxxxx                    (ASCII)
    1xxxxxxx 0xxxxxxx           (0x0080 - 0x3FFF)
    1xxxxxxx 1xxxxxxx 0xxxxxxx  (0x4000 - 0x1FFFFF)

If the Unicode range grew in future to require further bits you could use the same technique by allowing greater than 3 bytes.

    1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx (0x200000 - 0x10000000)

The obvious drawback to this approach is that it is inherently serial. You need to read each byte before considering the next, so it would perform worse than UTF-8 in most cases.

Another drawback is that it is not self-synchronizing, which is one of the benefits of UTF-8.

It also has the issue that you can represent some codepoints with more than one encoding: eg, put ASCII characters into 2 or 3 bytes. So you would need rules to use the minimal encoding for each codepoint.

As a space-saving technique, it may offer better density than UTF-8 or UTF-16 on some texts.

You could also use a fixed-width encoding of 24-bits to avoid the problem of reading it serially, but as computers typically work in powers of 2, you would align 24-bit values at 32-bit addresses and load them into 32-bit+ registers, so there's nothing to really gain in terms of performance here over UTF-32, but you could save a bit of space.

ablob3y ago

Isn't this just utf8 without information on how many of the following bytes belong to the glyph?

sparkie3y ago

It's more compact than UTF-8 because fewer bits are used for the encoding itself. There are 24 bits, and only 3 bits are used as part of the encoding, with the other 21 bits representing the code-point. (Precisely the number we need to represent all Unicode).

In UTF-8, a 3-byte encoding uses 8-bits as part of the encoding, a full byte worth of bits for the encoding itself, leaving only 16-bits for the codepoint. If you need higher code-points you need to use 4 bytes, where 11 bytes are the encoding and 21 bytes are the codepoint.

So UTF-8 is space efficient for ASCII, but ~1/3 of the bits are used for the encoding in for non-ASCII, versus a fixed 1/8 of the bits used for the encoding above for all 1-3 bytes. The above has a fixed 12.5% space overhead over raw codepoints. UTF-8 has 12.5% only for ASCII, and ~33% overhead for everything else.

Although it is not self-synchronizing like UTF-8, you can synchronize a reliable stream by holding a buffer of the previous byte. If the previous byte's value is >=0x80, the current byte is part of the same character. If it's <0x80, the current byte is the start of a new character, so it's still possible to do substring matching etc, fairly efficiently but slightly less efficiently than UTF-8. It makes it suitable for file storage, but not ideal for transmission.

That said, most sane protocols will prefix a string with a length (in bytes), so self-synchronization is not always an issue.

cstrahan3y ago· 1 in thread

The original quote for reference:

> So one Unicode character can be up to 5 bytes long and take up the same canvas space as 3 characters.

FWIW, I didn't read that as suggesting an upper bound of 5 bytes, but rather as an example using arbitrary numbers: N bytes of code units could, depending on the font providing the glyph(s) for the respective grapheme(s), could be rendered at M times the size of, say, the letter A, where N != M -- despite the font otherwise being monospaced. Which is just another way of saying that you must consult the font for the character widths involved.

I think you're reading that quote as an assertion that:

    For any grapheme G, G can be encoded in at most 5 bytes.

While what I think was being said was:

    There exists a grapheme G, where G is encoded in 5 bytes, and the respective glyph happens to be displayed at 3 times a single character (e.g. the letter A), despite the font otherwise being monospaced. Therefore you *must* consult the font for each glyph to correctly determine character widths.

rcovesonOP3y ago

Continue to the next sentence of context:

> You also need to read ahead as there are combination characters, for example a smiley combined with the color brow becomes a brown smiley.

Emphasis mine. Clearly combination characters are being treated separately.

Frankly I think it's crazy to read "up to 5 bytes" and not think that it suggest an upper bound. I think you're reaching for a highly questionably interpretation of a totally unambiguous clause. If the author meant to express what you're saying, they would certainly have written: "Some Unicode characters are 5 bytes long and take up the same canvas space as 3 characters". Which would still look incorrect if they followed it with the sentence "You also need to read ahead as there are combination characters...".

It is far more likely that the author is simply mistaken and should have said 4 bytes, and perhaps used the word "codepoint" instead of "character" in the original sentence. That's a perfectly understandable technical error, while the reinterpretation you're putting together would imply an error of colloquial language.

j / k navigate · click thread line to collapse

0 comments

9 comments · 3 top-level

WorldMaker3y ago· 3 in thread

I read this as colloquial English for "around" or "approximately". Not setting a bound limit, but setting an example size.

rcovesonOP3y ago

But you also read it as referring to ZWJ sequences? So the author has picked a number that is actually below average and they've worded it as up to...?

Saying a ZWJ sequence can be "up to 5 bytes" is like saying "the current generation of Intel processors run at clock speeds of up to 2 GHz".

WorldMaker3y ago

1 more reply

z3t43y ago

2 more replies

sparkie3y ago· 2 in thread

Codepoints themselves could technically all fit into 3 bytes (or 21 bits to be precise), but there is no standard 3-byte encoding. The highest Unicode codepoint is 0x10FFFF.

    0xxxxxxx                    (ASCII)
    1xxxxxxx 0xxxxxxx           (0x0080 - 0x3FFF)
    1xxxxxxx 1xxxxxxx 0xxxxxxx  (0x4000 - 0x1FFFFF)

If the Unicode range grew in future to require further bits you could use the same technique by allowing greater than 3 bytes.

    1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx (0x200000 - 0x10000000)

The obvious drawback to this approach is that it is inherently serial. You need to read each byte before considering the next, so it would perform worse than UTF-8 in most cases.

Another drawback is that it is not self-synchronizing, which is one of the benefits of UTF-8.

As a space-saving technique, it may offer better density than UTF-8 or UTF-16 on some texts.

ablob3y ago

Isn't this just utf8 without information on how many of the following bytes belong to the glyph?

sparkie3y ago

That said, most sane protocols will prefix a string with a length (in bytes), so self-synchronization is not always an issue.

cstrahan3y ago· 1 in thread

The original quote for reference:

> So one Unicode character can be up to 5 bytes long and take up the same canvas space as 3 characters.

I think you're reading that quote as an assertion that:

    For any grapheme G, G can be encoded in at most 5 bytes.

While what I think was being said was:

    There exists a grapheme G, where G is encoded in 5 bytes, and the respective glyph happens to be displayed at 3 times a single character (e.g. the letter A), despite the font otherwise being monospaced. Therefore you *must* consult the font for each glyph to correctly determine character widths.

rcovesonOP3y ago

Continue to the next sentence of context:

> You also need to read ahead as there are combination characters, for example a smiley combined with the color brow becomes a brown smiley.

Emphasis mine. Clearly combination characters are being treated separately.

j / k navigate · click thread line to collapse