> So one Unicode character can be up to 5 bytes long and take up the same canvas space as 3 characters.
FWIW, I didn't read that as suggesting an upper bound of 5 bytes, but rather as an example using arbitrary numbers: N bytes of code units could, depending on the font providing the glyph(s) for the respective grapheme(s), could be rendered at M times the size of, say, the letter A, where N != M -- despite the font otherwise being monospaced. Which is just another way of saying that you must consult the font for the character widths involved.
I think you're reading that quote as an assertion that:
For any grapheme G, G can be encoded in at most 5 bytes.
While what I think was being said was: There exists a grapheme G, where G is encoded in 5 bytes, and the respective glyph happens to be displayed at 3 times a single character (e.g. the letter A), despite the font otherwise being monospaced. Therefore you *must* consult the font for each glyph to correctly determine character widths.> You also need to read ahead as there are combination characters, for example a smiley combined with the color brow becomes a brown smiley.
Emphasis mine. Clearly combination characters are being treated separately.
Frankly I think it's crazy to read "up to 5 bytes" and not think that it suggest an upper bound. I think you're reaching for a highly questionably interpretation of a totally unambiguous clause. If the author meant to express what you're saying, they would certainly have written: "Some Unicode characters are 5 bytes long and take up the same canvas space as 3 characters". Which would still look incorrect if they followed it with the sentence "You also need to read ahead as there are combination characters...".
It is far more likely that the author is simply mistaken and should have said 4 bytes, and perhaps used the word "codepoint" instead of "character" in the original sentence. That's a perfectly understandable technical error, while the reinterpretation you're putting together would imply an error of colloquial language.
An idea for a variable width encoding of 1 to 3 bytes: Read the MSB of each byte: If it's 0, don't read any more bytes. If it's 1, read the next byte. Do the same (up to 3 times). The non MSB bits of each byte then make up the codepoint.
0xxxxxxx (ASCII)
1xxxxxxx 0xxxxxxx (0x0080 - 0x3FFF)
1xxxxxxx 1xxxxxxx 0xxxxxxx (0x4000 - 0x1FFFFF)
If the Unicode range grew in future to require further bits you could use the same technique by allowing greater than 3 bytes. 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx (0x200000 - 0x10000000)
The obvious drawback to this approach is that it is inherently serial. You need to read each byte before considering the next, so it would perform worse than UTF-8 in most cases.Another drawback is that it is not self-synchronizing, which is one of the benefits of UTF-8.
It also has the issue that you can represent some codepoints with more than one encoding: eg, put ASCII characters into 2 or 3 bytes. So you would need rules to use the minimal encoding for each codepoint.
As a space-saving technique, it may offer better density than UTF-8 or UTF-16 on some texts.
You could also use a fixed-width encoding of 24-bits to avoid the problem of reading it serially, but as computers typically work in powers of 2, you would align 24-bit values at 32-bit addresses and load them into 32-bit+ registers, so there's nothing to really gain in terms of performance here over UTF-32, but you could save a bit of space.
In UTF-8, a 3-byte encoding uses 8-bits as part of the encoding, a full byte worth of bits for the encoding itself, leaving only 16-bits for the codepoint. If you need higher code-points you need to use 4 bytes, where 11 bytes are the encoding and 21 bytes are the codepoint.
So UTF-8 is space efficient for ASCII, but ~1/3 of the bits are used for the encoding in for non-ASCII, versus a fixed 1/8 of the bits used for the encoding above for all 1-3 bytes. The above has a fixed 12.5% space overhead over raw codepoints. UTF-8 has 12.5% only for ASCII, and ~33% overhead for everything else.
Although it is not self-synchronizing like UTF-8, you can synchronize a reliable stream by holding a buffer of the previous byte. If the previous byte's value is >=0x80, the current byte is part of the same character. If it's <0x80, the current byte is the start of a new character, so it's still possible to do substring matching etc, fairly efficiently but slightly less efficiently than UTF-8. It makes it suitable for file storage, but not ideal for transmission.
That said, most sane protocols will prefix a string with a length (in bytes), so self-synchronization is not always an issue.
Saying a ZWJ sequence can be "up to 5 bytes" is like saying "the current generation of Intel processors run at clock speeds of up to 2 GHz".
If they were referring to ZWJ sequences (I don't think they were; I think they were just misremembering the maximum encoded length of a codepoint) and they had said "up to 35 bytes", then I might agree with you. It's still not technically accurate, but it's a reasonable colloquial usage, like saying "human males can grow up to seven feet tall".