Slicing by byte-offset is pretty unhelpful, given how many Unicode characters occupy more than one byte. In an encoding like UTF-16, that's "all of them" but even in UTF-8 it's still "most of them".
Slicing by UTF-16 code-unit is still pretty unhelpful, since a lot of Unicode characters (such as emoji) do not fit in 16 bits, and are encoded as "surrogate pairs". If you happen to slice a surrogate pair in half, you've made a mess.
Slicing by code-points (the numbers allocated by the Unicode consortium) is better, but not great. A shape like the "é" in "café" could be written as U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT. Those are separate code-points, but if you slice between them you'll wind up with "cafe" and an isolated acute accent that will stick to whatever it's next to, like this:́
When combining characters stick to a base character, the result is called a "grapheme cluster". Slicing by grapheme clusters is the best option, but it's expensive since you need a bunch of data from the Unicode database to find the edges of each cluster - it depends on the properties assigned to each character.
- Are you trying to control the rendered length? In that case the perfect solution is actually rendering the string.
- Are you limiting storage size? Then you need to find a good split point that is <N bytes. This is probably done using extended grapheme clusters. (Although this also isn't perfect)
I'm sure there are other use cases as well. But at the end of the day try to avoid splitting text if it can be helped.
Though substring(m, n) still makes sense in at least interactive text manipulation: how do you do copy/paste?
This is a good read on aspects of it:
Not quite true, you can get US state flags with this as well.
putStrLn "\x1f3f4\xe0075\xe0073\xe0074\xe0078\xe007f"
The first character is a flag, the last character is a terminator, and in between are the tag characters corresponding to the ASCII for ustx. Just take those characters and subtract 0xe0000 from them, 0x75, 0x73, 0x74, 0x78.
https://en.wikipedia.org/wiki/Tags_(Unicode_block)
Edit:
Just for fun:
import Data.StateCodes
import Data.Char
putStrLn $ map (map toLower . show . snd) allStates >>=
\stateCode -> '\x1f3f4':map (toEnum . (0xe0000+) . fromEnum) ("us" ++ stateCode) ++ "\xe007f"
The icon shows up in the right side of the URL bar, but you can always force it by prepending the URL, e.g. about:reader?url=<url>
Another thing worth calling out: you can get involved in emoji creation and Unicode in general. You can do this directly, or by working with groups like Emojination [0].
Some of the suggested emojis are marked as UTC rejected, some as ESC rejected or ESC pushback. Does it mean that both UTC and ESC has to approve each suggested emoji?
And is there a place to see the reason for rejection and a place to see what kind of pushback they are receiving?
> Unicode Emoji Subcommittee:
> The Unicode Emoji Subcommittee is responsible for the following:
> - Updating, revising, and extending emoji documents such as UTS #51: Unicode Emoji and Unicode Emoji Charts.
> - Taking input from various sources and reviewing requests for new emoji characters.
> - Creating proposals for the Unicode Technical Committee regarding additional emoji characters and new emoji-related mechanisms.
> - Investigating longer-term mechanisms for supporting emoji as images (stickers).
From https://unicode.org/emoji/techindex.html
Edit: Welp, the parent comment was asking what "ESC" stands for, but has now been updated, so this comment is now outdated :)
ESC contributes to UTC, along with other groups (e.g. Scripts Ad Hoc Group or IRG) or other individuals (you can submit documents to UTC [1]), and technically UTC has a right to reject ESC contributions. In reality however ESC manages a huge volume of emoji proposals to UTC and distills them down to a packaged submission, so UTC rarely outright rejects ESC contributions. After all ESC is a part of UTC so there is a huge overlap anyway (e.g. Mark Davis is the Unicode Consortium and ESC chair). "UTC rejected" emojis thus generally come from the direct proposal to UTC.
You can see a list of emoji requests [2] but it lacks much information. This lack of transparency in the ESC process is well known and was most directly criticized by contributing experts in 2017 [3]. ESC responded [4] that there are so many flawed proposals (with no regards to the submission criteria [5]) that it is infeasible to document all of them. IMHO it's not a very satisfactory answer, but still understandable.
[1] https://www.unicode.org/L2/
[2] https://www.unicode.org/emoji/emoji-requests.html
[3] https://www.unicode.org/L2/L2017/17147-emoji-subcommittee.pd...
[4] https://www.unicode.org/L2/L2017/17192-response-cmts.pdf
Mainly because skin tone modifiers [1] predate the ZWJ mechanism [2]. For hair colors there were two contending proposals [3] [4], one of which doesn't use ZWJ, and the ZWJ proposal was accepted because new modifiers (as opposed to ZWJ sequences) needed the architectural change [5].
[1] https://www.unicode.org/L2/L2014/14213-skin-tone-mod.pdf
[2] https://www.unicode.org/L2/L2015/15029r-zwj-emoji.pdf
[3] https://www.unicode.org/L2/L2017/17082-natural-hair-color.pd...
[4] https://www.unicode.org/L2/L2017/17193-hair-colour-proposal....
[5] https://www.unicode.org/L2/L2017/17283-response-hair.pdf
> The most popular encoding we use is called Unicode, with the two most popular variations called UTF-8 and UTF-16.
Unicode is a list of codepoints - the characters talked about in the rest of the article. These live in a number space that's very big (~2^23 as discussed).
You can talk about these codepoints in the abstract as this article does, but at some point you need to put them in a computer - store them on disk or transmit them over a network connection. To do this you need a way to make a stream of bytes store a series of unicode codepoints. This is an 'encoding', UTF-8 and UTF-16, UTF-32 etc. are different encodings.
UTF-32 is the simplest and most 'obvious' encoding to use. 32 bits is more than enough to represent every codepoint, so just use a 32-bit value to represent each codepoint, and keep them in a big array. This has a lot of value in simplicity, but it means that text ends up taking up a lot of space. Most western text (e.g. this page) fits in the first 127 bits and so for the majority of values, most of the bits will be 0.
UTF-16 is an abomination that is largely Microsoft's fault and is the default unicode encoding on Windows. It is based on the fact that most text in most language fits in the first 65535 unicode codepoints - referred to as the 'Basic Multilingual Plane'. This means that you can use a 16 bit value to represent most codepoints, so unicode is stored as an array of 16-bit values ("wide strings" in MS APIs). Obviously not all Unicode values fit in, so there is the capability to use two UTF-16 values to represent a code-point. There are many problems with UTF-16, but my favourite is that it really helps you to have 'unicode surprises' in your code. Something in your stack that assumes single byte characters and barfs on higher unicode values is well known, and you find it in testing fairly often. Because UTF-16 is a single value for the vast majority of normal codepoints, it makes that worse by making it only happen in a very small number of cases that you will inevitably only discover in production.
UTF-8 is the generally agreed to be the best encoding (particularly among people who don't work for Microsoft). It is a full variable length encoding, so a single codepoint can take 1, 2, 3 or 4 bytes. It has lots of nice properties, but one is that codepoints that are <= 127 encode using a single byte. This means that proper ASCII is valid UTF-8.
https://www.youtube.com/watch?v=mhvaeHoIE24
"Smiling Cat Face With Heart Eyes Emoji" plays a major role. :)
It doesn't cover the same ground as this wonderful post with its study of variation selectors and skin-tone modifiers, but it provides the prerequisites leading up to it.
> UTF-16 is an abomination that is largely Microsoft's fault
I think that's unfair. The problem lies more in the conceptualization of "Unicode" in the late 1980s as a two-byte fixed-width encoding whose 65k-sized code space would be enough for the characters of all the world's living languages. (I cover that here: https://www.youtube.com/watch?v=mhvaeHoIE24&t=7m10s ) It turns out that we needed more space, and if Asian countries had had more say from the start, it would have been obvious earlier that a problem existed.
> I think that's unfair.
Fair enough. It was a moderately 'emotional' response caused by some painful history of issues caused by 2-byte assumptions.
The problem I suppose is that MS actually moved to Unicode earlier than most of the industry (to their credit), and therefore played Guinea pig in discovering what works and doesn't. My complaint now is that I feel they should start a migration to UTF-8 (yes I know how challenging that would be).
Well this nerd-sniped me pretty hard
https://next.observablehq.com/@jobleonard/which-unicode-flag...
That was a fun little exercise, but enough time wasted, back to work.
According to the account I've heard, it's the greeks who invented the alphabet, by accident. The Phoenician script used single symbols to represent consonants, including the glottal stop (and some pharyngeal consonant that would likely be subject to a similar process, iirc). The glottal stop was represented by aleph, and because Greek didn't have contrastive glottal stops in its phoneme inventory, Greeks just interpreted the vowel that followed it as what the symbol was meant to represent.
It's a bit of a just so story, but also completely plausible.
> “Ü” is a single grapheme cluster, even though it’s composed of two codepoints: U+0055 UPPER-CASE U followed by U+0308 COMBINING DIAERESIS.
would be a great opportunity to talk about normal form, because there’s also a single code point version: “latin capital letter u with diaeresis”.
I may be wrong however.
Why would 2^21 not be a multiple of 2^3?
17 x 2^16 = 17 x 2^13 x 2^3
(reposted/edited because * was interpreted as formatting)
And how do we as a community propose new icons while considering others to be removed/replaced?
Back in 2015, Instagram did a blog post on similar challenges they came across implementing emoji hashtags [1]. Spoiler alert: they programmatically constructed a huge regex to detect them.
[1] https://instagram-engineering.com/emojineering-part-ii-imple...
Big thank you to the OP.
Unicode is a character set, not an encoding UTF-8, UTF-16, etc. are encodings of that character set
Regarding Windows and flags, I heard it was a geopolitical issue. Basically, to support flag emoji you’d have to decide whether or not to recognize some states (e.g. Taiwan) which can anger other states. Not sure if that’s the real reason or not.
A couple questions I still have: 1. Why make flags multiple code points when there’s plenty of unused address space to assign a single code point? 2. Any entertaining backstories regarding platform specific non-standard emoji, such as Windows ninja cat (https://emojipedia.org/ninja-cat/)? Why would they use those code points rather than ? 3. Is it possible to modify Windows to render emoji using Apple’s font (or a modified Segue that looks like Apple’s)? 4. Which emoji look the most different depending on platform? Are there any that cause miscommunication? 5. Do any glyphs render differently based on background color, e.g. dark mode?
I have one nit about an omission: in addition to the emoji presentation selector, FE0F, which forces "presentation as emoji", there's also the text presentation selector, FE0E, which does the opposite [1].
The Emoji_Presentation property [2] determines when either is required; code points with both an emoji and a text presentation and the property set to "Yes" default to emoji presentation without a selector and require FE0E for text presentation; code points with the property set to "No" default to text presentation and require FE0F for emoji presentation.
There's a list [3] with all emoji that have two presentations, and the first three rows of the Default Style Values table [4] shows which emoji default to which style.
[1]: https://unicode.org/reports/tr51/#Emoji_Variation_Sequences
[2]: http://unicode.org/reports/tr51/#Emoji_Properties_and_Data_F...