UTF-16: Encodes the entire 21-bit range, encoding most of the first 0000 to FFFF range as-is, and using surrogate pairs in that range to encode 00010000 to 0010FFFF. The latter range is shifted to 00000000 to 000FFFFF before encoding, which can be encoded in the 20 bits that surrogate pairs provide. This is a subtlety that one likely does not appreciate if one learns UTF-8 first and expects UTF-16 to be like it.
UTF-8: Could originally encode 00000000 to 7FFFFFFF, but since the limitation to just the first 17 planes a lot of UTF-8 codecs in the real world actually no longer contain the code for handling the longer sequences. Witness things like the UTF-8 codec in MySQL, whose 32-bit support conditional compilation switch is mentioned at https://news.ycombinator.com/item?id=17311048 .
Not exactly. A conforming decoder MUST reject them.
MySQL’s problem is that, by default, it can’t even handle all valid code points.
> I'm not at all convinced that 2^21 codepoints will be enough, so someday it'd be nice to be able to get past UTF-16 and move to UTF-8
UTF-16 currently uses up to 2 16-bit code units per code point, whereas UTF-8 uses up to 4 8-bit code units per code point, and the latter wastes more bits for continuation than the former. How is "getting past UTF-16 and moving to UTF-8" supposed to increase the number of code points we can represent, as claimed above? If anything, UTF-16 wastes fewer bits in the current maximum number of code units, so it should have more room for expansion without increasing the number of code units.
And as you can see, if you do work out the bits, you find that cryptonector is wrong, since UTF-8 (as it has been standardized from almost the start of the 21st century, and as codecs in the real world have taken to implementing it since) encodes no more bits than UTF-16 does. It's 21 bits for both.
OTOH, UTF-8, as originally defined, can encode 2³¹ codepoints.