Emoji Under the Hood (opens in new tab)

(tonsky.me)

430 pointskogir5y ago89 comments

89 comments

70 comments · 28 top-level

mojuba5y ago· 8 in thread

Can someone explain, what are the rules for substring(m, n) given all the madness that's today's Unicode? Is it standardized or it's up to the implementations?

thristian5y ago

It depends what your string is a string of.

Slicing by byte-offset is pretty unhelpful, given how many Unicode characters occupy more than one byte. In an encoding like UTF-16, that's "all of them" but even in UTF-8 it's still "most of them".

Slicing by UTF-16 code-unit is still pretty unhelpful, since a lot of Unicode characters (such as emoji) do not fit in 16 bits, and are encoded as "surrogate pairs". If you happen to slice a surrogate pair in half, you've made a mess.

Slicing by code-points (the numbers allocated by the Unicode consortium) is better, but not great. A shape like the "é" in "café" could be written as U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT. Those are separate code-points, but if you slice between them you'll wind up with "cafe" and an isolated acute accent that will stick to whatever it's next to, like this:́

When combining characters stick to a base character, the result is called a "grapheme cluster". Slicing by grapheme clusters is the best option, but it's expensive since you need a bunch of data from the Unicode database to find the edges of each cluster - it depends on the properties assigned to each character.

andreareina5y ago

Doesn't splitting by grapheme cluster also depends on which version of the unicode standard you use, since new standards come with new combinations?

kevincox5y ago

The standard answer is "don't". Just treat text is a blob, but the other question is what are you trying to accomplish?

- Are you trying to control the rendered length? In that case the perfect solution is actually rendering the string.

- Are you limiting storage size? Then you need to find a good split point that is <N bytes. This is probably done using extended grapheme clusters. (Although this also isn't perfect)

I'm sure there are other use cases as well. But at the end of the day try to avoid splitting text if it can be helped.

_ZeD_5y ago

it think the only resonable rule for substring(m, n) is "don't"

mojuba5y ago

So string is no longer a "string of characters", it is in fact a program (not Turing complete) that you need to execute.

Though substring(m, n) still makes sense in at least interactive text manipulation: how do you do copy/paste?

4 more replies

EMM_3865y ago

It is up to the implementation.

This is a good read on aspects of it:

https://hsivonen.fi/string-length/

RedNifre5y ago

Maybe have m,n refer to grapheme clusters instead of bytes/code points?

mojuba5y ago

Apparently it's what Swift does when you try to get the length of a string. Though there's no more plain substring() since Swift 5, it was removed to indicate it's no longer O(1). You will get different results across languages though.

mannerheim5y ago· 5 in thread

> Currently they are used for these three flags only: England, Scotland and Wales:

Not quite true, you can get US state flags with this as well.

Sniffnoy5y ago

This may be supported in some implementations, but currently only England, Scotland, and Wales are officially in the Unicode data files and recommended for general interchange. You can see that they're the only examples of RGI_Emoji_Tag_Sequence listed here: https://www.unicode.org/Public/emoji/13.1/emoji-sequences.tx...

TheRealSteel5y ago

Does this have anything to do with why Google Keyboard/Gboard doesn't have the Scottish flag? It's by far my most used emoji and my keyboard not having it drives me nuts.

scatters5y ago

Why not switch to a keyboard that does have it?

petepete5y ago

I've never seen them use, have they actually been implemented by any of the creators?

mannerheim5y ago

If I type the following into ghci, I get the state flag of Texas:

putStrLn "\x1f3f4\xe0075\xe0073\xe0074\xe0078\xe007f"

The first character is a flag, the last character is a terminator, and in between are the tag characters corresponding to the ASCII for ustx. Just take those characters and subtract 0xe0000 from them, 0x75, 0x73, 0x74, 0x78.

https://en.wikipedia.org/wiki/Tags_(Unicode_block)

Edit:

Just for fun:

import Data.StateCodes

import Data.Char

putStrLn $ map (map toLower . show . snd) allStates >>=

\stateCode -> '\x1f3f4':map (toEnum . (0xe0000+) . fromEnum) ("us" ++ stateCode) ++ "\xe007f"

2 more replies

BlueGh0st5y ago· 4 in thread

I wish I could read this without getting a migraine. The "darkmode" joke was funny until I realized there was no actual way to turn it on.

jffry5y ago

Firefox's reader mode works great and includes a dark theme.

The icon shows up in the right side of the URL bar, but you can always force it by prepending the URL, e.g. about:reader?url=<url>

tobz10005y ago

https://darkreader.org/

sundarurfriend5y ago

I just turned this off today, after one too many "an extension is slowing this page down" warnings from Firefox, always from Dark Reader. It's a pretty useful addon, but there's enough websites that implement their own dark mode that it's less necessary these days (I hope), and possibly making it not worth the slowdown.

vlmutolo5y ago

I always edit the CSS style when this site comes up

MrGilbert5y ago· 4 in thread

Reading about the 2 million codepoints: Is there a good set of open-source licensed fonts which cover as many codepoints as possible? Just curiosity, no real usecase at the moment. I don't think it would make sense to create one huge font for this, right?

pta20025y ago

Google's Noto Fonts[1] attempt to cover all of Unicode and are released under the SIL Open Font License.

[1] https://www.google.com/get/noto/

MrGilbert5y ago

That looks incredible complete, thank you!

dan-robertson5y ago

There’s a project called, I think, gnufont but their font is a bitmap font...

MrGilbert5y ago

Ah, thank you! Searching for "gnufont" brought me to[1], which looks pretty nice indeed.

[1]: https://www.gnu.org/software/freefont/

1 more reply

devadvance5y ago· 3 in thread

Fantastic post that builds up knowledge along the way. A fun case where this type of knowledge was relevant: when creating emoji short links with a couple characters (symbols), I made sure to snag both URLs: one with the emoji (codepoint + `U+FE0F`) and one with just the symbol codepoint.

Another thing worth calling out: you can get involved in emoji creation and Unicode in general. You can do this directly, or by working with groups like Emojination [0].

[0] http://www.emojination.org/

codetrotter5y ago

The emojination website mentions UTC and ESC. UTC in this context certainly means Unicode Technical Committee. And after a bit of Googling it seems that ESC is the Unicode Emoji Subcommittee.

Some of the suggested emojis are marked as UTC rejected, some as ESC rejected or ESC pushback. Does it mean that both UTC and ESC has to approve each suggested emoji?

And is there a place to see the reason for rejection and a place to see what kind of pushback they are receiving?

dgellow5y ago

It's for "Emoji SubCommittee" (aka ESC).

> Unicode Emoji Subcommittee:

> The Unicode Emoji Subcommittee is responsible for the following:

> - Updating, revising, and extending emoji documents such as UTS #51: Unicode Emoji and Unicode Emoji Charts.

> - Taking input from various sources and reviewing requests for new emoji characters.

> - Creating proposals for the Unicode Technical Committee regarding additional emoji characters and new emoji-related mechanisms.

> - Investigating longer-term mechanisms for supporting emoji as images (stickers).

From https://unicode.org/emoji/techindex.html

Edit: Welp, the parent comment was asking what "ESC" stands for, but has now been updated, so this comment is now outdated :)

1 more reply

lifthrasiir5y ago

It's complicated. So this mainly boils down to the relationship between UTC and ESC.

ESC contributes to UTC, along with other groups (e.g. Scripts Ad Hoc Group or IRG) or other individuals (you can submit documents to UTC [1]), and technically UTC has a right to reject ESC contributions. In reality however ESC manages a huge volume of emoji proposals to UTC and distills them down to a packaged submission, so UTC rarely outright rejects ESC contributions. After all ESC is a part of UTC so there is a huge overlap anyway (e.g. Mark Davis is the Unicode Consortium and ESC chair). "UTC rejected" emojis thus generally come from the direct proposal to UTC.

You can see a list of emoji requests [2] but it lacks much information. This lack of transparency in the ESC process is well known and was most directly criticized by contributing experts in 2017 [3]. ESC responded [4] that there are so many flawed proposals (with no regards to the submission criteria [5]) that it is infeasible to document all of them. IMHO it's not a very satisfactory answer, but still understandable.

[1] https://www.unicode.org/L2/

[2] https://www.unicode.org/emoji/emoji-requests.html

[3] https://www.unicode.org/L2/L2017/17147-emoji-subcommittee.pd...

[4] https://www.unicode.org/L2/L2017/17192-response-cmts.pdf

[5] https://www.unicode.org/emoji/proposals.html

lifthrasiir5y ago· 2 in thread

> One weird inconsistency I’ve noticed is that hair color is done via ZWJ, while skin tone is just modifier emoji with no joiner. Why? Seriously, I am asking you: why? I have no clue.

Mainly because skin tone modifiers [1] predate the ZWJ mechanism [2]. For hair colors there were two contending proposals [3] [4], one of which doesn't use ZWJ, and the ZWJ proposal was accepted because new modifiers (as opposed to ZWJ sequences) needed the architectural change [5].

[1] https://www.unicode.org/L2/L2014/14213-skin-tone-mod.pdf

[2] https://www.unicode.org/L2/L2015/15029r-zwj-emoji.pdf

[3] https://www.unicode.org/L2/L2017/17082-natural-hair-color.pd...

[4] https://www.unicode.org/L2/L2017/17193-hair-colour-proposal....

[5] https://www.unicode.org/L2/L2017/17283-response-hair.pdf

kevincox5y ago

Randal Monroe was also wondering why most of the emoji aren't just modifiers: https://xkcd.com/1813/

vanderZwan5y ago

I wonder how many years it'll take for someone to train a neural network to generate emojis for all possible modifiers, regardless of whether they're currently real combinations.

3 more replies

rkangel5y ago· 2 in thread

The article is great, but there is one slightly misleading bit at the start:

> The most popular encoding we use is called Unicode, with the two most popular variations called UTF-8 and UTF-16.

Unicode is a list of codepoints - the characters talked about in the rest of the article. These live in a number space that's very big (~2^23 as discussed).

You can talk about these codepoints in the abstract as this article does, but at some point you need to put them in a computer - store them on disk or transmit them over a network connection. To do this you need a way to make a stream of bytes store a series of unicode codepoints. This is an 'encoding', UTF-8 and UTF-16, UTF-32 etc. are different encodings.

UTF-32 is the simplest and most 'obvious' encoding to use. 32 bits is more than enough to represent every codepoint, so just use a 32-bit value to represent each codepoint, and keep them in a big array. This has a lot of value in simplicity, but it means that text ends up taking up a lot of space. Most western text (e.g. this page) fits in the first 127 bits and so for the majority of values, most of the bits will be 0.

UTF-16 is an abomination that is largely Microsoft's fault and is the default unicode encoding on Windows. It is based on the fact that most text in most language fits in the first 65535 unicode codepoints - referred to as the 'Basic Multilingual Plane'. This means that you can use a 16 bit value to represent most codepoints, so unicode is stored as an array of 16-bit values ("wide strings" in MS APIs). Obviously not all Unicode values fit in, so there is the capability to use two UTF-16 values to represent a code-point. There are many problems with UTF-16, but my favourite is that it really helps you to have 'unicode surprises' in your code. Something in your stack that assumes single byte characters and barfs on higher unicode values is well known, and you find it in testing fairly often. Because UTF-16 is a single value for the vast majority of normal codepoints, it makes that worse by making it only happen in a very small number of cases that you will inevitably only discover in production.

UTF-8 is the generally agreed to be the best encoding (particularly among people who don't work for Microsoft). It is a full variable length encoding, so a single codepoint can take 1, 2, 3 or 4 bytes. It has lots of nice properties, but one is that codepoints that are <= 127 encode using a single byte. This means that proper ASCII is valid UTF-8.

rectang5y ago

For people who want to hear more on this subject I gave a talk for Papers We Love Seattle on UTF-8, its origins and evolution, and how it compares against other encodings:

https://www.youtube.com/watch?v=mhvaeHoIE24

"Smiling Cat Face With Heart Eyes Emoji" plays a major role. :)

It doesn't cover the same ground as this wonderful post with its study of variation selectors and skin-tone modifiers, but it provides the prerequisites leading up to it.

> UTF-16 is an abomination that is largely Microsoft's fault

I think that's unfair. The problem lies more in the conceptualization of "Unicode" in the late 1980s as a two-byte fixed-width encoding whose 65k-sized code space would be enough for the characters of all the world's living languages. (I cover that here: https://www.youtube.com/watch?v=mhvaeHoIE24&t=7m10s ) It turns out that we needed more space, and if Asian countries had had more say from the start, it would have been obvious earlier that a problem existed.

rkangel5y ago

>> UTF-16 is an abomination that is largely Microsoft's fault

> I think that's unfair.

Fair enough. It was a moderately 'emotional' response caused by some painful history of issues caused by 2-byte assumptions.

The problem I suppose is that MS actually moved to Unicode earlier than most of the industry (to their credit), and therefore played Guinea pig in discovering what works and doesn't. My complaint now is that I feel they should start a migration to UTF-8 (yes I know how challenging that would be).

vanderZwan5y ago· 2 in thread

> Flags don’t have dedicated codepoints. Instead, they are two-letter ligatures. (...) There are 258 valid two-letter combinations. Can you find them all?

Well this nerd-sniped me pretty hard

https://next.observablehq.com/@jobleonard/which-unicode-flag...

That was a fun little exercise, but enough time wasted, back to work.

mercer5y ago

Haha, playing around with reversing flags was the first thing I thought about trying.

vanderZwan5y ago

The surprising result (to me at least) was that out of 270 valid letter combinations, 105 can be reversed. The odd number is easy to explain: letter pairs like MM => MM can add a single flag instead of a pair of two flags, but the fact that almost two out of every five flags are reversible feels pretty high to me.

1 more reply

truefossil5y ago· 2 in thread

I wonder why Mediterranean nations switched from ideograms to alphabet as soon as one was invented. Probably they did not have enough surplus grain to feed something like the Unicode consortium?

meepmorp5y ago

Hieroglyphics weren't really ideographic after a very early point, because it's a pain in the ass making up new symbols for every word. Very quickly, it transitioned to being largely an abjad, representing only consonants. Abjads work reasonably well for semitic languages, as the consonantal roots of words carry the meaning and a reader would be able to fill in the vowels themselves via context.

According to the account I've heard, it's the greeks who invented the alphabet, by accident. The Phoenician script used single symbols to represent consonants, including the glottal stop (and some pharyngeal consonant that would likely be subject to a similar process, iirc). The glottal stop was represented by aleph, and because Greek didn't have contrastive glottal stops in its phoneme inventory, Greeks just interpreted the vowel that followed it as what the symbol was meant to represent.

It's a bit of a just so story, but also completely plausible.

kps5y ago

An alphabet (or syllabary, abjad, abugida) has a small set of symbols that can express anything, which means that it could be used by people who did something other than read and write for a living. Probably no accident that the first to catch on, and the root of possibly all others, was spread by Phoenician traders.

peteretep5y ago· 2 in thread

An excellent article, although:

> “Ü” is a single grapheme cluster, even though it’s composed of two codepoints: U+0055 UPPER-CASE U followed by U+0308 COMBINING DIAERESIS.

would be a great opportunity to talk about normal form, because there’s also a single code point version: “latin capital letter u with diaeresis”.

colejohnson665y ago

Does anyone know the history behind why there’s two ways to “encode” things like that? What’s the rationale for having both combining and precombined codepoints?

bombcar5y ago

I believe a lot of the "combined" characters are (basically) from importing old codepages directly into Unicode, and they did that so it would be a simple formula to convert from the various codepages in use.

I may be wrong however.

woko5y ago· 2 in thread

> Unicode allocates 2²¹ (~2 mil) characters called codepoints. Sorry, programmers, but it’s not a multiply of 8 .

Why would 2^21 not be a multiple of 2^3?

RedNifre5y ago

It's a typo, they meant ~2²¹ instead of 2²¹, because it's 17*2^16, which is more like ~2^20.087. (And that's not even true either, since a couple values like FFFF are forbidden)

howtodowtle5y ago

Of course, 17 x 2^16 is also a multiple of 2^3:

17 x 2^16 = 17 x 2^13 x 2^3

(reposted/edited because * was interpreted as formatting)

1 more reply

artur_makly5y ago· 2 in thread

What I really want to know is the story behind how these emoji's came to be?! Who was tasked to come up with this sticker list of symbols? What was the decision/strategy behind the selection of these symbols? etc etc. it seems soooo arbitrary at first-glance.

And how do we as a community propose new icons while considering others to be removed/replaced?

rynt5y ago

99PI did a story on the process of submitting a new emoji request to the Unicode Consortium that you might find interesting: https://99percentinvisible.org/episode/person-lotus-position...

artur_makly5y ago

brilliant. thank you {{ U+1F64F }}

aglionby5y ago· 1 in thread

Great post, entertainingly written.

Back in 2015, Instagram did a blog post on similar challenges they came across implementing emoji hashtags [1]. Spoiler alert: they programmatically constructed a huge regex to detect them.

[1] https://instagram-engineering.com/emojineering-part-ii-imple...

lifthrasiir5y ago

Nowadays you can refer to UAX #31 for hashtag identifiers (first specified in 2016): https://www.unicode.org/reports/tr31/#hashtag_identifiers

ijidak5y ago· 1 in thread

This is eye opening. So many frustrations I've had with emoji over the years is explained via this post.

Big thank you to the OP.

chronogram5y ago

What kind of frustrations?

kaeruct5y ago· 1 in thread

I'm confused about the part saying flags don't work on Windows because I can see them on Firefox (on Windows). They don't work on Edge though.

tonsky5y ago

I guess FF ships its own version

imtiyaz5y ago· 1 in thread

Never gone to these nitty gritties. Very well explained. Thanks Nikita.

tonsky5y ago

You are welcome! Glad you liked it

breck5y ago

I thought I knew Emoji, but there was a lot I didn’t know. Thank you, a very enjoyable and enlightening read. Also, “dingbats”! I rarely seen that word since I was a kid (when I had no idea what that voodoo was but loved it).

Hawzen5y ago

> The most popular encoding we use is called Unicode

Unicode is a character set, not an encoding UTF-8, UTF-16, etc. are encodings of that character set

Robizzle015y ago

Love it, thanks for writing it up.

Regarding Windows and flags, I heard it was a geopolitical issue. Basically, to support flag emoji you’d have to decide whether or not to recognize some states (e.g. Taiwan) which can anger other states. Not sure if that’s the real reason or not.

A couple questions I still have: 1. Why make flags multiple code points when there’s plenty of unused address space to assign a single code point? 2. Any entertaining backstories regarding platform specific non-standard emoji, such as Windows ninja cat (https://emojipedia.org/ninja-cat/)? Why would they use those code points rather than ? 3. Is it possible to modify Windows to render emoji using Apple’s font (or a modified Segue that looks like Apple’s)? 4. Which emoji look the most different depending on platform? Are there any that cause miscommunication? 5. Do any glyphs render differently based on background color, e.g. dark mode?

bewuethr5y ago

This is a really nice overview!

I have one nit about an omission: in addition to the emoji presentation selector, FE0F, which forces "presentation as emoji", there's also the text presentation selector, FE0E, which does the opposite [1].

The Emoji_Presentation property [2] determines when either is required; code points with both an emoji and a text presentation and the property set to "Yes" default to emoji presentation without a selector and require FE0E for text presentation; code points with the property set to "No" default to text presentation and require FE0F for emoji presentation.

There's a list [3] with all emoji that have two presentations, and the first three rows of the Default Style Values table [4] shows which emoji default to which style.

[1]: https://unicode.org/reports/tr51/#Emoji_Variation_Sequences

[2]: http://unicode.org/reports/tr51/#Emoji_Properties_and_Data_F...

[3]: https://unicode.org/emoji/charts/emoji-variants.html

[4]: https://unicode.org/emoji/charts-13.0/emoji-style.html

yuntei5y ago

and now to see how emoji rendering is completely broken, put a gear u+2699 text variant and emoji variant in some html and set the font to menlo in one element, and monaco in another element and then view it in chrome, safari desktop, and safari ios, and also select and right click on it in chrome, and maybe also post it into the comment section of various websites. Every single combination of text variant and emoji variant will be displayed in complete randomness :)

tomduncalf5y ago

Really interesting and well written (and entertaining!) post. I was vaguely aware of most of it but hadn’t appreciated how the ZWJ system for more complex emojis made up of basin ones means the meaning can be discerned even if your device doesn’t support the new emoji, clever approach!

zimpenfish5y ago

Apropos of nothing, macOS 11.3 beta and iOS 14.5 do support the combined emojis near the bottom - instead of <heart><fire>, I actually get <flaming heart> as expected.

z3t45y ago

Related: implementing Emoji support in a text editor: https://xn--zta-qla.com/en/blog/editor10.htm

avipars5y ago

Really interesting article, why haven't platforms banned Ų̷̡̡̨̫͍̟̯̣͎͓̘̱̖̱̣͈͍̫͖̮̫̹̟̣͉̦̬̬͈͈͔͙͕̩̬̐̏̌̉́̾͑̒͌͊͗́̾̈̈́̆̅̉͌̋̇͆̚̚̚͠ͅ or figured out a way to parse/contain the character to it's container?

itsmeamario5y ago

Great quality post. I'd like to see more things like this on HN. Interesting and I learnt a lot about emojis and UTF.

mshenfield5y ago

It's a post about emojis, but I feel like I understand Unicode better now?

remux5y ago

Great post!

j / k navigate · click thread line to collapse

89 comments

70 comments · 28 top-level

mojuba5y ago· 8 in thread

Can someone explain, what are the rules for substring(m, n) given all the madness that's today's Unicode? Is it standardized or it's up to the implementations?

thristian5y ago

It depends what your string is a string of.

Slicing by byte-offset is pretty unhelpful, given how many Unicode characters occupy more than one byte. In an encoding like UTF-16, that's "all of them" but even in UTF-8 it's still "most of them".

andreareina5y ago

Doesn't splitting by grapheme cluster also depends on which version of the unicode standard you use, since new standards come with new combinations?

kevincox5y ago

The standard answer is "don't". Just treat text is a blob, but the other question is what are you trying to accomplish?

- Are you trying to control the rendered length? In that case the perfect solution is actually rendering the string.

- Are you limiting storage size? Then you need to find a good split point that is <N bytes. This is probably done using extended grapheme clusters. (Although this also isn't perfect)

I'm sure there are other use cases as well. But at the end of the day try to avoid splitting text if it can be helped.

_ZeD_5y ago

it think the only resonable rule for substring(m, n) is "don't"

mojuba5y ago

So string is no longer a "string of characters", it is in fact a program (not Turing complete) that you need to execute.

Though substring(m, n) still makes sense in at least interactive text manipulation: how do you do copy/paste?

4 more replies

EMM_3865y ago

It is up to the implementation.

This is a good read on aspects of it:

https://hsivonen.fi/string-length/

RedNifre5y ago

Maybe have m,n refer to grapheme clusters instead of bytes/code points?

mojuba5y ago

mannerheim5y ago· 5 in thread

> Currently they are used for these three flags only: England, Scotland and Wales:

Not quite true, you can get US state flags with this as well.

Sniffnoy5y ago

TheRealSteel5y ago

Does this have anything to do with why Google Keyboard/Gboard doesn't have the Scottish flag? It's by far my most used emoji and my keyboard not having it drives me nuts.

scatters5y ago

Why not switch to a keyboard that does have it?

petepete5y ago

I've never seen them use, have they actually been implemented by any of the creators?

mannerheim5y ago

If I type the following into ghci, I get the state flag of Texas:

putStrLn "\x1f3f4\xe0075\xe0073\xe0074\xe0078\xe007f"

https://en.wikipedia.org/wiki/Tags_(Unicode_block)

Edit:

Just for fun:

import Data.StateCodes

import Data.Char

putStrLn $ map (map toLower . show . snd) allStates >>=

\stateCode -> '\x1f3f4':map (toEnum . (0xe0000+) . fromEnum) ("us" ++ stateCode) ++ "\xe007f"

2 more replies

BlueGh0st5y ago· 4 in thread

I wish I could read this without getting a migraine. The "darkmode" joke was funny until I realized there was no actual way to turn it on.

jffry5y ago

Firefox's reader mode works great and includes a dark theme.

The icon shows up in the right side of the URL bar, but you can always force it by prepending the URL, e.g. about:reader?url=<url>

tobz10005y ago

https://darkreader.org/

sundarurfriend5y ago

vlmutolo5y ago

I always edit the CSS style when this site comes up

MrGilbert5y ago· 4 in thread

pta20025y ago

Google's Noto Fonts[1] attempt to cover all of Unicode and are released under the SIL Open Font License.

[1] https://www.google.com/get/noto/

MrGilbert5y ago

That looks incredible complete, thank you!

dan-robertson5y ago

There’s a project called, I think, gnufont but their font is a bitmap font...

MrGilbert5y ago

Ah, thank you! Searching for "gnufont" brought me to[1], which looks pretty nice indeed.

[1]: https://www.gnu.org/software/freefont/

1 more reply

devadvance5y ago· 3 in thread

Another thing worth calling out: you can get involved in emoji creation and Unicode in general. You can do this directly, or by working with groups like Emojination [0].

[0] http://www.emojination.org/

codetrotter5y ago

The emojination website mentions UTC and ESC. UTC in this context certainly means Unicode Technical Committee. And after a bit of Googling it seems that ESC is the Unicode Emoji Subcommittee.

Some of the suggested emojis are marked as UTC rejected, some as ESC rejected or ESC pushback. Does it mean that both UTC and ESC has to approve each suggested emoji?

And is there a place to see the reason for rejection and a place to see what kind of pushback they are receiving?

dgellow5y ago

It's for "Emoji SubCommittee" (aka ESC).

> Unicode Emoji Subcommittee:

> The Unicode Emoji Subcommittee is responsible for the following:

> - Updating, revising, and extending emoji documents such as UTS #51: Unicode Emoji and Unicode Emoji Charts.

> - Taking input from various sources and reviewing requests for new emoji characters.

> - Creating proposals for the Unicode Technical Committee regarding additional emoji characters and new emoji-related mechanisms.

> - Investigating longer-term mechanisms for supporting emoji as images (stickers).

From https://unicode.org/emoji/techindex.html

Edit: Welp, the parent comment was asking what "ESC" stands for, but has now been updated, so this comment is now outdated :)

1 more reply

lifthrasiir5y ago

It's complicated. So this mainly boils down to the relationship between UTC and ESC.

[1] https://www.unicode.org/L2/

[2] https://www.unicode.org/emoji/emoji-requests.html

[3] https://www.unicode.org/L2/L2017/17147-emoji-subcommittee.pd...

[4] https://www.unicode.org/L2/L2017/17192-response-cmts.pdf

[5] https://www.unicode.org/emoji/proposals.html

lifthrasiir5y ago· 2 in thread

> One weird inconsistency I’ve noticed is that hair color is done via ZWJ, while skin tone is just modifier emoji with no joiner. Why? Seriously, I am asking you: why? I have no clue.

[1] https://www.unicode.org/L2/L2014/14213-skin-tone-mod.pdf

[2] https://www.unicode.org/L2/L2015/15029r-zwj-emoji.pdf

[3] https://www.unicode.org/L2/L2017/17082-natural-hair-color.pd...

[4] https://www.unicode.org/L2/L2017/17193-hair-colour-proposal....

[5] https://www.unicode.org/L2/L2017/17283-response-hair.pdf

kevincox5y ago

Randal Monroe was also wondering why most of the emoji aren't just modifiers: https://xkcd.com/1813/

vanderZwan5y ago

I wonder how many years it'll take for someone to train a neural network to generate emojis for all possible modifiers, regardless of whether they're currently real combinations.

3 more replies

rkangel5y ago· 2 in thread

The article is great, but there is one slightly misleading bit at the start:

> The most popular encoding we use is called Unicode, with the two most popular variations called UTF-8 and UTF-16.

Unicode is a list of codepoints - the characters talked about in the rest of the article. These live in a number space that's very big (~2^23 as discussed).

rectang5y ago

For people who want to hear more on this subject I gave a talk for Papers We Love Seattle on UTF-8, its origins and evolution, and how it compares against other encodings:

https://www.youtube.com/watch?v=mhvaeHoIE24

"Smiling Cat Face With Heart Eyes Emoji" plays a major role. :)

It doesn't cover the same ground as this wonderful post with its study of variation selectors and skin-tone modifiers, but it provides the prerequisites leading up to it.

> UTF-16 is an abomination that is largely Microsoft's fault

rkangel5y ago

>> UTF-16 is an abomination that is largely Microsoft's fault

> I think that's unfair.

Fair enough. It was a moderately 'emotional' response caused by some painful history of issues caused by 2-byte assumptions.

vanderZwan5y ago· 2 in thread

> Flags don’t have dedicated codepoints. Instead, they are two-letter ligatures. (...) There are 258 valid two-letter combinations. Can you find them all?

Well this nerd-sniped me pretty hard

https://next.observablehq.com/@jobleonard/which-unicode-flag...

That was a fun little exercise, but enough time wasted, back to work.

mercer5y ago

Haha, playing around with reversing flags was the first thing I thought about trying.

vanderZwan5y ago

1 more reply

truefossil5y ago· 2 in thread

I wonder why Mediterranean nations switched from ideograms to alphabet as soon as one was invented. Probably they did not have enough surplus grain to feed something like the Unicode consortium?

meepmorp5y ago

It's a bit of a just so story, but also completely plausible.

kps5y ago

peteretep5y ago· 2 in thread

An excellent article, although:

> “Ü” is a single grapheme cluster, even though it’s composed of two codepoints: U+0055 UPPER-CASE U followed by U+0308 COMBINING DIAERESIS.

would be a great opportunity to talk about normal form, because there’s also a single code point version: “latin capital letter u with diaeresis”.

colejohnson665y ago

Does anyone know the history behind why there’s two ways to “encode” things like that? What’s the rationale for having both combining and precombined codepoints?

bombcar5y ago

I may be wrong however.

woko5y ago· 2 in thread

> Unicode allocates 2²¹ (~2 mil) characters called codepoints. Sorry, programmers, but it’s not a multiply of 8 .

Why would 2^21 not be a multiple of 2^3?

RedNifre5y ago

It's a typo, they meant ~2²¹ instead of 2²¹, because it's 17*2^16, which is more like ~2^20.087. (And that's not even true either, since a couple values like FFFF are forbidden)

howtodowtle5y ago

Of course, 17 x 2^16 is also a multiple of 2^3:

17 x 2^16 = 17 x 2^13 x 2^3

(reposted/edited because * was interpreted as formatting)

1 more reply

artur_makly5y ago· 2 in thread

And how do we as a community propose new icons while considering others to be removed/replaced?

rynt5y ago

99PI did a story on the process of submitting a new emoji request to the Unicode Consortium that you might find interesting: https://99percentinvisible.org/episode/person-lotus-position...

artur_makly5y ago

brilliant. thank you {{ U+1F64F }}

aglionby5y ago· 1 in thread

Great post, entertainingly written.

Back in 2015, Instagram did a blog post on similar challenges they came across implementing emoji hashtags [1]. Spoiler alert: they programmatically constructed a huge regex to detect them.

[1] https://instagram-engineering.com/emojineering-part-ii-imple...

lifthrasiir5y ago

Nowadays you can refer to UAX #31 for hashtag identifiers (first specified in 2016): https://www.unicode.org/reports/tr31/#hashtag_identifiers

ijidak5y ago· 1 in thread

This is eye opening. So many frustrations I've had with emoji over the years is explained via this post.

Big thank you to the OP.

chronogram5y ago

What kind of frustrations?

kaeruct5y ago· 1 in thread

I'm confused about the part saying flags don't work on Windows because I can see them on Firefox (on Windows). They don't work on Edge though.

tonsky5y ago

I guess FF ships its own version

imtiyaz5y ago· 1 in thread

Never gone to these nitty gritties. Very well explained. Thanks Nikita.

tonsky5y ago

You are welcome! Glad you liked it

breck5y ago

Hawzen5y ago

> The most popular encoding we use is called Unicode

Unicode is a character set, not an encoding UTF-8, UTF-16, etc. are encodings of that character set

Robizzle015y ago

Love it, thanks for writing it up.

bewuethr5y ago

This is a really nice overview!

There's a list [3] with all emoji that have two presentations, and the first three rows of the Default Style Values table [4] shows which emoji default to which style.

[1]: https://unicode.org/reports/tr51/#Emoji_Variation_Sequences

[2]: http://unicode.org/reports/tr51/#Emoji_Properties_and_Data_F...

[3]: https://unicode.org/emoji/charts/emoji-variants.html

[4]: https://unicode.org/emoji/charts-13.0/emoji-style.html

yuntei5y ago

tomduncalf5y ago

zimpenfish5y ago

Apropos of nothing, macOS 11.3 beta and iOS 14.5 do support the combined emojis near the bottom - instead of <heart><fire>, I actually get <flaming heart> as expected.

z3t45y ago

Related: implementing Emoji support in a text editor: https://xn--zta-qla.com/en/blog/editor10.htm

avipars5y ago

itsmeamario5y ago

Great quality post. I'd like to see more things like this on HN. Interesting and I learnt a lot about emojis and UTF.

mshenfield5y ago

It's a post about emojis, but I feel like I understand Unicode better now?

remux5y ago

Great post!

j / k navigate · click thread line to collapse