I'm in the process of designing a scripting language and implementing it in C++. I plan to put together a YouTube series about it. (Doesn't everyone want to see Bison and Flex mixed with proper unit tests and C++20 code?)
Due to my future intended use case, I needed good support for Unicode. I thought that I could write it myself, and I was wrong. I wasted two weeks (in my spare time, mostly evenings) trying to cobble together things that should work, identifying patterns, figuring out how to update it as Unicode itself is updated, thinking about edge cases, i18n, zalgo text, etc. And then I finally reached the point where I knew enough to know that I was making the wrong choice.
I'm now using ICU. (https://icu.unicode.org/) It's huge, it was hard to get it working in my environment, and there are very few examples of it's usage online, but after the initial setup dues are paid, it WORKS.
Aside: Yes, I know I'm crazy for implementing a programming language that I intend for serious usage. Yes, I have good reasons for doing it, and yes I have considered alternatives. But it's fun, so I'm doing it anyways.
Moral of the story: Dealing with Unicode is hard, and if you think it shouldn't be that hard, then you probably don't know enough about the problem!
- Counting, rendering and collapsing grapheme clusters (like the flag emoji)
- Converting between legacy encodings (shiftjis, ko8, etc) and UTF-8 / UTF-16
- Canonicalization
If all you need is to deal with utf8 byte buffers, you don't need all that stuff. And your code can stay simple, small and fast.
IIRC the rust standard library doesn't bother supporting any of the hard parts in unicode. The only real unicode support in std is utf8 validation for strings. All the complex aspects of unicode are delegated to 3rd party crates.
By contrast, nodejs (and web browsers) do all of this. But they implement it in the same way you're suggesting - they simply call out to libicu.
Rust's core library gives char methods such as is_numeric which asks whether this Unicode codepoint is in one of Unicode's numeric classes such as the letter-like-numerics and various digits. (Rust does provide char with is_ascii_digit and is_ascii_hexdigit if that's all you actually cared about)
So yes, the Rust standard library is carrying around the entire Unicode standard class rule list among other things, of course Rust's library isn't built to a vast binary, so if you never use these features your binary doesn't get that code.
I think the trap Unicode got in to is technically they can have infinite emoji so they just don’t ever have a way to say no to new proposals.
I always feel like those emoji were added on purpose in order to force implementations to fix their unicode support. Before emoji were added, most software had completely broken support for anything beyond the BMP (case study: MySQL's so-called "UTF8" encoding). The introduction of emoji, and their immediate popularity, forced many systems to better support astral planes (that is officially acknowledged: https://unicode.org/faq/emoji_dingbats.html#EO1)
Progressively, emoji using more advanced features got introduced, which force systems (and developers) to fix their unicode-handling, or at least improve it somewhat e.g. skintones for combining codepoints, etc....
> I think the trap Unicode got in to is technically they can have infinite emoji so they just don’t ever have a way to say no to new proposals.
You should try to follow a new character through the process, because that's absolutely not what happens and shepherding a new emoji through to standardisation is not an easy task. The unicode consortium absolutely does say no, and has many reasons to do so. There's an entire page on just proposal guidelines (https://unicode.org/emoji/proposals.html), and following it does not in any way ensure it'll be accepted.
The problem with Unicode is simply that it’s trying to solve a very hard problem.
Yes, this adds a lot of complexity, but it's really a question of whether that complexity is justified in order to support all of the world's languages. And I think many would argue that it is.
Isn't it about time that we have some common language that every other language builds on?
That language is C. It is debatable whether it was a good choice, but at least this is how it turned.
You'll probably be told "oh, assume US ASCII" or something, but in the meantime, if you can back that up when they dig into it, you'll look really smart.
It's just not a thing people do, so it's just... not very interesting to argue about what the 'correct' way to do it is.
Similarly, any argument over whether a string has n characters or n+1 characters in it is almost entirely meaningless and uninteresting for real world string processing problems. Allow me to let you into a secret:
there's never really such a thing as a 'character limit'
There might be a 'printable character width' limit; or there might be a 'number of bytes of storage' limit. Which means interesting questions about a string include things like 'how wide is it when displayed in this font?' or 'how many bytes does it take to store or transmit it?'... But there's rarely any point where, for a general string, it is really interesting to know 'how many characters does the string contain?'
Processing direct user text input is the only situation where you really need a rich notion of 'character', because you need to have a clear sense of what will happen if the user moves a cursor using a left or right arrow, and for exactly what will be deleted when a user hits backspace, or copied/cut and pasted when they operate on a selection. The ij ligature might be a single glyph, but is it a single character? When does it matter? Probably not at all unless you're trying to decide whether to let a user put a cursor in the middle of it or not.
And next to that, I just feel to argue that there is such a thing as a 'correct' way to reverse "Rijndæl" according to a strict reading of Unicode glyph composability rules seems like a supremely silly thing to try to do.
I'd much rather, when asked to reverse a string, more developers simply said 'that doesn't make sense, you can't arbitrarily chunk up a string and reassemble it in a different order and expect any good to come of it'.
It took me a bit, but I think I have an answer. It's about 15 years ago. I didn't actually do the original design, but I perpetuated it and didn't remove it. We reversed domain name strings (which, given that they are a subset of ASCII, actually is a well-defined operation) so that the DB we're using, which supported efficient prefix lookups but not suffix lookups, could be used to efficiently query for all subdomains of a given domain, by reversing the domain and using that as the prefix.
I mean this as strong support for your point, not a contradictory "gotcha". I'm a big believer in not doing lots of work to save effort or make correct something you do less than once a decade, e.g., http://www.jerf.org/iri/post/2954 . And it's not even a gotcha anyhow, because we aren't reversing a general string; we were reversing a string very tightly constrained to a subset of ASCII where the operation was fully well-defined. I can't think of when I ever reversed a general string.
What is “3 >= 2", reversed?
What is “Rijksmuseum”, reversed? (https://en.wikipedia.org/wiki/IJ_(digraph); capitalization isn’t simple here, either (https://en.wikipedia.org/wiki/IJ_(digraph)#Capitalisation)
What is “Schroeder”, reversed? (https://en.wikipedia.org/wiki/Diaeresis_(diacritic)#Printing...)
There is a solution to this which is to compute the list of grapheme clusters, and reverse that.
I really highly doubt it.
How do you reverse this?: مرحبًا ، هذه سلسلة.
Can you do it without any knowledge about whether what looks like one character is actually a special case joiner between two adjacent codepoints that only happens in one direction? Can you do it without knowing that this string appears wrongly in the HN textbbox due to an apparent RTL issue?
It's just not well-defined to reverse a string, and the reason we say it's not meaningful is that no User Story ever starts "as a visitor to this website I want to be able to see this string in opposite order, no not just that all the bytes are reversed, but you know what I mean."
In Norwegian, “æ” is a letter, so I believe (as a non-speaker) that they would reverse “blåbærene” to “eneræbålb”; but in English, it’s a ligature representing the diphthong “ae”, and if asked to reverse “æsthetic” I would certainly write “citehtsea” and consider “citehtsæ” to be wrong. (And I enjoy writing the ligature; I fairly consistently write and type æsthetic rather than aesthetic, though I only write encyclopædia instead of encyclopaedia when I’m in a particular sort of mood.)
In Dutch, the digraph “ij” is sometimes considered a ligature and sometimes a letter; as a non-speaker, I don’t know whether natives would say that it should be treated as an atom in reversing or not.
And not all languages will have the concept of reversing even letters, let alone other things. Face it: in English we have the concept of reversing things, but it just doesn’t work the same way in other languages. Sure, UAX #29 defines something that happens to be a good heuristic for reversing, but it doesn’t define reversing, and in the grand scheme of things reversing grapheme-wise is still Wrong. “Reversing a string” is just not a globally meaningful concept.
Another person here has cited Cherokee transliteration, where one extended grapheme cluster turns into multiple English letters. You can apply this to translation in general, but also even keep it inside English and ask: what are we reversing? Letters? Phonemes? Syllables? Words? There are plenty of possibilities which are used in different contexts (and it’s mostly in puzzles, frankly, not general day-to-day life).
The concept of grapheme clusters is acknowledged as approximate. Collations are acknowledged as approximate. Reversing would be even more approximate.
And more importantly: What is the use case for a reversed string?
It depends on your point of view. From a strict point of view, it does exactly mean it is no longer possible. By contrast, we all 100% knew what reversing an ASCII string meant, with no ambiguity.
It also depends on the version of Unicode you are using, and oh by the way, unicode strings do not come annotated with the version they are in. Since it's supposed to be backwards compatible hopefully the latest works, but I'd be unsurprised if someone can name something whose correct reversal depends on the version of Unicode. And, if not now, then in some later not-yet-existing pair of Unicode standards.
I don't understand why in maths finding one single counter-example is enough to disprove a theorem yet in programming people seem to be happy with 99.x % of success rate. To me, "It may not work perfectly in 100% of the cases" exactly means "no longer possible" as "possible" used to imply that it would work consistently, 100% of the time.
If you really wanted to, you could write a string reversal algorithm that treated two-character emojis as an indivisible element of the string and preserved its order (just as you'd need to preserve the order of the bytes in a single multi-byte UTF-8 character). You'd just need to carefully specify what you mean by the terms "string", "character" and "reverse" in a way that includes ordered, multi-character sequences like flag emojis.
This is a bit of a hobby horse, but imagine if every time you read an article in English on your phone some of the letters were replaced with "equivalent" Greek or Cyrillic one and you can get an idea of the annoyance. Yeah, you can still read it with a bit of thought, but who wants to read that way?
The subset of equivalent letters, or different ones? If they looked the same, it wouldn't bother me if the letters in the center were a single codepoint between European languages:
https://upload.wikimedia.org/wikipedia/commons/8/84/Venn_dia...
If someone challenges you to reverse an image, what do you do? Do you invert the colors? Mirror horizontally? Mirror vertically? Just reverse the byte order?
Case in point: a "struct" in languages like C and Rust is literally a specification of how to treat segments of a "string" of contiguous bytes.
But these languages don’t provide true “string” support. They just have a vaguely useful type alias that renames a byte array to a char array, and a bunch of byte array functions that have been renamed to sound like string functions. In reality all the language supports are byte arrays, with some syntactical sugar so you can pretend they’re strings.
Newer languages, like go and Python 3, that where created in the world of Unicode provide true string types. Where the type primitives properly deal with idea of variable length characters, and provide tools to make it easy to manipulate strings and characters as independent concepts. If you want to ignore Unicode, because your specific application doesn’t need to understand, then you need cast your strings into byte arrays, and all pretences of true string manipulation vanish at the same time.
This is not to say the C can’t handle Unicode etc. just like the language doesn’t provide true primitives to manipulate strings, instead relies on libraries to provide that functionality, which is perfectly valid approach. Just as baking in more complex string primitives into your language is also a perfectly valid approach. It’s just a question of trade offs and use cases, I.e. the problem at the heart of all good engineering.
Is it a PASCAL string (length byte followed by data) or a C string (arbitrary run of bytes terminated by a null character)?
Tragically, in C, a string is just barely a data structure, because it must have \0 at the end.
If it were the complete absence of a data structure, we would need some way to get at the length of it, and could treat a slice of it as the same sort of thing as the thing itself.
Edit: Turns out my browser wasn't rendering the flags.
The reason was because they didn't want to be caught up in any arguments about what flag to render for a country during any dispute, as with, e.g. the flag for Afghanistan after the Taliban took control.
[Microsoft had this same issue with the timezone map in Windows. The early versions were cool and had country borders, but then I think it was India/Pakistan threw a fit and it was simplified to take the borders out]
Except it has been deleted from the ISO 3166-2 registry, so not having it is perfectly valid (arguably more so than having it).
Flags have another issue here in that they can change even when the country stays the same - a recent example here being Afghanistan, but also France who recently changed the official shades of the colors in their flag. Ideally you'd want a new Unicode representation for any changed flags in order to not retroactively change the meaning in old documents.
It would have been difficult to get the CN delegation to sign off on a list that contained TW, although there are probably others.
As a result, two-letter ISO codes are useless for many potential applications, such as, for example, recording which country a book was published in, unless you supplement them with a reference to a particular version of the standard.
Is there a way of getting the Czechoslovakian flag as an emoji? And did Serbia and Montenegro get round to making a flag?
But then again, flags seem to be not only Unicode-hard but post-Unicode-hard.
Flags are not that hard, they're a very specific block combining in very predictable way. They're little more than ligatures. Family emoji are much harder.
And this is not "post-Unicode" in any way.
If such re-purposing continues, it might be easier to go straight to utf-32 for some use cases.
What's more, it's really not that difficult to start at the end of a valid UTF-8 string and get the characters in reverse order. UTF-8 is well-designed that way in that there's never ambiguity about whether you're looking at the beginning byte of a code point.
So python behaves as expected: the 2 character string, when reversed, becomes "SU". Similar stuff happens with the other "flag" strings.
I'm sure emojis in my phone are outdated. I'm not sure how that affects whether I see a flag or letters.
To update chrome, I'd have to give it permission to access my contacts. That ain't happening. (Phone OS is too old for per-app permissions)
https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...
I actually don't know how/why Python is apparently using code points, since they are variable length. That seems like a compromise between using code units and using grapheme clusters that gets you the worst of both worlds.
Edit: Maybe it uses UTF-32 under the hood when it's doing array operations on code points?
My impression is most modern languages that bother with unicode (swift, rust, nim) are using utf-8, and doing linear time operations to handle unicode. I think that's the right approach, as I don't recall ever needing random access on a unicode string.
[0]: http://www.unicode.org/reports/tr29/#Table_Combining_Char_Se...
> A separate mechanism (emoji tag sequences) is used for regional flags, such as England , Scotland , Wales , Texas or California . It uses U+1F3F4 WAVING BLACK FLAG and formatting tag characters instead of regional indicator symbols. It is based on ISO 3166-2 regions with hyphen removed and lowercase, e.g. GB-ENG → gbeng, terminating with U+E007F CANCEL TAG. Flag of England is therefore represented by a sequence U+1F3F4, U+E0067, U+E0062, U+E0065, U+E006E, U+E0067, U+E007F.
https://bugs.chromium.org/p/chromium/issues/detail?id=127243...
Previously discussed:
https://news.ycombinator.com/item?id=20914184
https://news.ycombinator.com/item?id=26591373
As for this article & Python - as usual it is biasing towards convenience and implicit behavior rather than properly handling all edge cases.
Compare with Rust where you can't "reverse" a string - that is not a defined operation. But you can either break it into a sequence of characters or graphemes and then reverse that, with expected results: https://play.rust-lang.org/?version=stable&mode=debug&editio...
(Sadly the grapheme segmentation is not part of standard library, at least yet)
Seeing as grapheme segmentation is a moving target that only makes sense.
Basically: A single, unpaired RIS counts as a single grapheme. Similarly, a pair of RIS count as a single grapheme. Now imagine if your cursor position is after an RTS, and you arrow backwards (assuming LTR text, imagine your cursor is to the right of an RIS, and you press the left arrow.) Your textbox should now move the cursor to the left by one grapheme. How do you figure out where this is, in code units? You basically have to scan backwards until you find the first non-RIS codepoint, and then you have to match them up into pairs to figure out if your left-arrow movement should correspond to a movement of one or two codepoints.
This is a longstanding source of bugs, and if you're bored you can play around with pasting a huge sequence of flags into a textfield and then trying to navigate around it with the arrow keys/mouse. There are some broken implementations out there.
edit: while I'm thinking about this I will point out that an alternative design, which would have solved this problem (and which was first pointed out to me by @raphlinus) would have been to have two separate sets of RI symbols, one for 'first position' and one for 'second position'; then you could always determine the appropriate cursor position without needing context. Isn't hindsight a wonderful thing?
Gladly, the creators of UTF-18 did have that foresight so at least we don't have this problem at the code unit -> code point level.
So are there any good libraries that can deal with code points that are merged together into a single pictographic and reverse them "as expected"?
An interesting exercise would be emoji detection during string reversal to preserve the original emoji. I though something like that would be the crux of the article.
Am I wrong about single character emojis?
Flag emojis and others are displayed in double the size on Windows 10 using Firefox Nightly https://bugzilla.mozilla.org/show_bug.cgi?id=1746795
Edit: Actually Firefox ships a copy of twemoji for fallback purposes, so flags will still render.
I definitely thought it'd be something like [I am a Flag] and [The flag ID between 0 and 65535]. And reversing it would be [Flag ID] + [I am a Flag] which would not be a defined "component" and instead rendered as the individual two nonsense characters.
Whether "reversing flag emojis" causes such transformations will depend on what is meant by "reversing", which is kind of the whole point here: there are a number of possible interpretations of "reverse".
'(Spanish flag)'[::-1]
basically ''.join([chr(127466), chr(127480)]) vs. ''.join([chr(127466), chr(127480)])[::-1]
I'll add this to my collection of party tricks and show myself out.
Cool article!
(1) https://en.wikipedia.org/wiki/Emoji#Joining
edit: forgot HN doesn't render emojis. Better read it directly on Wikipedia i guess.
As to the content, for all the deep dive, a simple link to https://unicode.org/reports/tr51/#Flags and what an emoji is, would have saved so much exposition. I also wish he'd touched on normalization. With the amount of time he's demanding from readers he could have mentioned this important subject. Because then he could discuss why (starting from his emoji example) a-grave (à) might or might not be reversible, depending how the character is composed.
Also wish he'd pointed to some libraries that can do such reversals.
[1] ``Trojan Source: Invisible Vulnerabilities'': https://trojansource.codes/trojan-source.pdf
I once had the very unpleasant experience of debugging a case where data saved with R on windows and loaded on macOS ended up with individually double-encoded codepoints.
Not fun.
> Reverses a string. Technically, this function reverses the codepoints in a string and its main utility is for reversed-order string processing [...]. See also [...] `graphemes` from module Unicode to operate on user-visible "characters" (graphemes) rather than codepoints.
Properly reversing a string of flags (or any other grapheme clusters) is just a `using Unicode: grapheme` away.
I do feel like these are all 'gotcha' questions - I haven't seen any real-world requirement to reverse a string and then have it be displayed in a useful way.
https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...
1. You are acting in byte space and it’s pretty unambiguous what should happen. We are not acknowledging the semantics of language and alphabets.
2. You’re acting in language space and these operations will behave the way you probably think they should (depending on your cultural expectations, probably)
Example: https://github.com/kennell/flagz/blob/master/flagz.py
I'm trying to find the best combination of UTS-46, UTS-51, UTS-39, and prior work on IDN resolution w/r/t confusables: https://adraffy.github.io/ens-normalize.js/test/report-confu...
Personally, I found the Unicode spec very messy. Critical information is all over the place. You can see the direct effect of this when you compare various packages across different languages and discover that every library disagrees in multiple places. Even JS String.normalize() isn't consistent in the latest version of most browsers: https://adraffy.github.io/ens-normalize.js/test/report-nf.ht... (fails in Chrome, Safari)
The major difference between ENS and DNS is emoji are front and center. ENS resolves by computing a hash of a name in a canonicalized form. Since resolution must happen decentralized, simply punting to punycode and relying custom logic for Unicode-handling isn't possible. On-chain records are 1:1, so there's no fuzzy matching either. Additionally, ENS is actively registering names, so any improvement to the system must preserve as many names as possible.
At the moment, I'm attempting to improve upon the confusables in the Common/Greek/Latin/Cyrillic scripts, and will combine these new grouping with the mixed-script limitations similar to IDN handling in Chromium.
Interactive Demo: https://adraffy.github.io/ens-normalize.js/test/resolver.htm...
Also this emoji report is pretty cool: https://adraffy.github.io/ens-normalize.js/test/report-emoji...
Yes it's a pain, but the way the standard library designed its types force you to handle conversions correctly, for example when byte arrays are converted to UTF-8 strings and may contain invalid UTF-8 sequences.
Unicode defines grapheme clusters[1] that represent "user-perceived characters" separating a string into those and reversing seems like a pretty good way to go about it.
let v = "Flag: " String(v.reversed()) // Output: :galF v.count // Output: 7
You can't reduce the bytes in UTF-8 or UTF-16, because you'll scramble the encoding. But you could parsing the string, codepoint-at-a-time, handling the specifics of UTF-8, or UTF-16 with its surrogate pairs, and reversing those. This sounds equivalent to reversing UTF-32, and I believe is what the original poster was imagining.
Except you can't do that, because Unicode has composing characters. Now, I'm American and too stupid to type anything other than ASCII, but I know about n+~ = ñ. If you have the pre-composed version of ñ, you can reverse the codepoint (it's one codepoint). If you don't have it, and you have n+dead ~, you can't reverse it, or in the word "año" you might put the ~ on the "o". (Even crazier things happen when you get to the ligatures in Arabic; IIRC one of those is about 20 codepoints.)
So we can't just reverse codepoints, even ancient versions of Unicode. Other posters have talked about the even more exotic stuff like Emoji + skin tone. It's necessary to be very careful.
Now, the old fart in me says that ASCII never had this problem. But the old fart in me knows about CRLF in text protocols, and that's never LFCR; and that if you want to make a ñ in ASCII you must send n ^H ~. I guess you can reverse that, but if you want to do more exotic things it becomes less obvious.
(IIRC UCS-2 is the deadname, now we call it UTF-16 to remind us to always handle surrogate pairs correctly, which we don't.)
TLDR: Strings are hard.
There's another word that comes to mind when thinking about those two: metastasis.