undefined | Better HN

0 pointschrisseaton7y ago0 comments

> allows expansion up to 32 bits

It doesn't allow expansion up to 32 bits. How would that even be possible? Some of the bits are needed to indicate to use the second two bytes, so there's a hole in the range and it's less than 32 bits.

0 comments

8 comments · 2 top-level

theoh7y ago· 5 in thread

"Surrogate pairs" use two code units (32 bits total) to increase the number of code points that can be represented. It seems a lot more complicated than UTF-8. I have no idea how many code points can actually be represented in theory. There are under 21 bits' worth of Unicode codepoints, though, so I'm sure the claim that UTF-16 can represent all of them isn't some kind of trick.

eridius7y ago

A surrogate pair can represent 20 bits, or 2^20 values. A single UTF-16 byte is of course 16 bits, or 2^16 values. 2^20 + 2^16 == 0x110000, which is precisely as many unicode code points as there are (0–0x10FFFF).

Decoding UTF-16 (with surrogate pairs) is pretty similar to decoding UTF-8 except the state machine is simpler (as it's just 1–2 code units per codepoint, instead of 1–3). It also means that if you validate that you have a high surrogate and a low surrogate, the combination is guaranteed to be a valid codepoint (whereas with UTF-8 there are invalid encodings, e.g. any 1-byte codepoint can actually be encoded using 2 or 3 bytes instead, which the state machine must reject).

theoh7y ago

Wait a minute.

"A surrogate pair can represent 20 bits, or 2^20 values. A single UTF-16 byte is of course 16 bits, or 2^16 values. 2^20 + 2^16 == 0x110000, which is precisely as many unicode code points as there are (0–0x10FFFF)."

This explanation can't be quite right, because the possible surrogate code unit values are carved out of the set of 2^16 code units. So you are double counting those surrogate code units.

From Wikipedia: "Since the ranges for the high surrogates (0xD800–0xDBFF), low surrogates (0xDC00–0xDFFF), and valid BMP characters (0x0000–0xD7FF, 0xE000–0xFFFF) are disjoint, it is not possible for a surrogate to match a BMP character, or for two adjacent code units to look like a legal surrogate pair."

1 more reply

eridius7y ago

Errata: I stated UTF-8 decoding was 1–3 bytes per codepoint. It's 1–4.

theoh7y ago

One of us should try to make this clearer on the relevant Wikipedia page.

jcranmer7y ago

UTF-16 encodes a wider character with two surrogate pairs, each pair holding 10 bits of the codepoint. That's why Unicode has 17 planes each containing 65,536 characters, which would otherwise seem odd to have a non-power of two number of Unicode codepoints.

gowld7y ago· 1 in thread

32 bit-representations (compared to UCS-2), not 32 bits worth of values, so it can represent all of Unicode.

UCS-2 is limited to only 16-bit representation, making it _impossible_ to express Unicode values outside the basic multilingual plane (BMP).

stouset7y ago

Correct, this is what I meant. UTF-16 can use up to 32 bits to represent a larger number of characters, but cannot uniquely represent 2^32 possible characters.

j / k navigate · click thread line to collapse

0 comments

8 comments · 2 top-level

theoh7y ago· 5 in thread

eridius7y ago

theoh7y ago

Wait a minute.

This explanation can't be quite right, because the possible surrogate code unit values are carved out of the set of 2^16 code units. So you are double counting those surrogate code units.

1 more reply

eridius7y ago

Errata: I stated UTF-8 decoding was 1–3 bytes per codepoint. It's 1–4.

theoh7y ago

One of us should try to make this clearer on the relevant Wikipedia page.

jcranmer7y ago

gowld7y ago· 1 in thread

32 bit-representations (compared to UCS-2), not 32 bits worth of values, so it can represent all of Unicode.

UCS-2 is limited to only 16-bit representation, making it _impossible_ to express Unicode values outside the basic multilingual plane (BMP).

stouset7y ago

Correct, this is what I meant. UTF-16 can use up to 32 bits to represent a larger number of characters, but cannot uniquely represent 2^32 possible characters.

j / k navigate · click thread line to collapse