Why can't you reverse a string with a flag emoji? (opens in new tab)

(davidamos.dev)

189 pointsda124y ago239 comments

239 comments

161 comments · 58 top-level

coreyp_14y ago· 15 in thread

If you think the Unicode flag emoji take a lot of bytes, then consider the family emoji! (https://unicode.org/emoji/charts/full-emoji-list.html#family)

I'm in the process of designing a scripting language and implementing it in C++. I plan to put together a YouTube series about it. (Doesn't everyone want to see Bison and Flex mixed with proper unit tests and C++20 code?)

Due to my future intended use case, I needed good support for Unicode. I thought that I could write it myself, and I was wrong. I wasted two weeks (in my spare time, mostly evenings) trying to cobble together things that should work, identifying patterns, figuring out how to update it as Unicode itself is updated, thinking about edge cases, i18n, zalgo text, etc. And then I finally reached the point where I knew enough to know that I was making the wrong choice.

I'm now using ICU. (https://icu.unicode.org/) It's huge, it was hard to get it working in my environment, and there are very few examples of it's usage online, but after the initial setup dues are paid, it WORKS.

Aside: Yes, I know I'm crazy for implementing a programming language that I intend for serious usage. Yes, I have good reasons for doing it, and yes I have considered alternatives. But it's fun, so I'm doing it anyways.

Moral of the story: Dealing with Unicode is hard, and if you think it shouldn't be that hard, then you probably don't know enough about the problem!

josephg4y ago

Handling unicode can be fine, depending on what you're doing. The hard parts are:

- Counting, rendering and collapsing grapheme clusters (like the flag emoji)

- Converting between legacy encodings (shiftjis, ko8, etc) and UTF-8 / UTF-16

- Canonicalization

If all you need is to deal with utf8 byte buffers, you don't need all that stuff. And your code can stay simple, small and fast.

IIRC the rust standard library doesn't bother supporting any of the hard parts in unicode. The only real unicode support in std is utf8 validation for strings. All the complex aspects of unicode are delegated to 3rd party crates.

By contrast, nodejs (and web browsers) do all of this. But they implement it in the same way you're suggesting - they simply call out to libicu.

tialaramex4y ago

> The only real unicode support in std is utf8 validation for strings.

Rust's core library gives char methods such as is_numeric which asks whether this Unicode codepoint is in one of Unicode's numeric classes such as the letter-like-numerics and various digits. (Rust does provide char with is_ascii_digit and is_ascii_hexdigit if that's all you actually cared about)

So yes, the Rust standard library is carrying around the entire Unicode standard class rule list among other things, of course Rust's library isn't built to a vast binary, so if you never use these features your binary doesn't get that code.

Gigachad4y ago

It always feels like the most amount of work goes to the least used emoji. So many revisions and additions to the family emoji and yet it’s one of the ones I don’t recall anyone ever using.

I think the trap Unicode got in to is technically they can have infinite emoji so they just don’t ever have a way to say no to new proposals.

masklinn4y ago

> It always feels like the most amount of work goes to the least used emoji.

I always feel like those emoji were added on purpose in order to force implementations to fix their unicode support. Before emoji were added, most software had completely broken support for anything beyond the BMP (case study: MySQL's so-called "UTF8" encoding). The introduction of emoji, and their immediate popularity, forced many systems to better support astral planes (that is officially acknowledged: https://unicode.org/faq/emoji_dingbats.html#EO1)

Progressively, emoji using more advanced features got introduced, which force systems (and developers) to fix their unicode-handling, or at least improve it somewhat e.g. skintones for combining codepoints, etc....

> I think the trap Unicode got in to is technically they can have infinite emoji so they just don’t ever have a way to say no to new proposals.

You should try to follow a new character through the process, because that's absolutely not what happens and shepherding a new emoji through to standardisation is not an easy task. The unicode consortium absolutely does say no, and has many reasons to do so. There's an entire page on just proposal guidelines (https://unicode.org/emoji/proposals.html), and following it does not in any way ensure it'll be accepted.

1 more reply

laumars4y ago

They do say no though. Frequently too.

The problem with Unicode is simply that it’s trying to solve a very hard problem.

1 more reply

jonas214y ago

This work wasn't done for emoji. They use the same zero-width joiner character [1] that exists to support Indic scripts like Devanagari, and any system that properly handles these languages will also properly handle the emoji.

Yes, this adds a lot of complexity, but it's really a question of whether that complexity is justified in order to support all of the world's languages. And I think many would argue that it is.

[1] https://en.wikipedia.org/wiki/Zero-width_joiner

1 more reply

Vindicis4y ago

I know how that feels. I wrote a little c++ program to fetch data in Unicode from a dB and then normalize it to ascii to be used for analytic purposes. A lot faster to do it on ascii than trying to handle all the fun cases of how many ways can an e etc... be input. ICU to the rescue! Took a couple weeks of getting up to speed as ICU itself wasn't too bad to figure out. But, you find out very quickly that to use it, you need to have a good understanding of a number of the Unicode technical reports to actually understand how to make use of it. Fun times indeed.

DecoPerson4y ago

Do you have a YouTube for people to subscribe to in anticipation of you releasing your YouTube series about your work? The development processes of new languages is so intriguing.

coreyp_14y ago

I'll post about it here on HN after I have a few episodes up.

dagmx4y ago

It would actually be pretty interesting to see how you use Bison and Flex with utf-8. Most resources say to not bother due to lack of support for Unicode, but they're so ubiquitous

account424y ago

Do they need special support for UTF-8? One of the nice things about UTF-8 is that you can treat it as an 8-bit encoding in many cases if you only care about substrings and don't need to decode individual non-ASCII characters.

1 more reply

amelius4y ago

Why is this stuff even reinvented for every programming language?

Isn't it about time that we have some common language that every other language builds on?

johndough4y ago

> Isn't it about time that we have some common language that every other language builds on?

That language is C. It is debatable whether it was a good choice, but at least this is how it turned.

IncRnd4y ago

So said every writer of a standard immediately before writing another standard to replace all others.

1 more reply

lmm4y ago

ICU is out there with bindings in many languages. People who know what they're doing use ICU.

jerf4y ago· 15 in thread

So, in terms of acing interviews, increasingly one of the best answers to the question "Write some code that reverses a string" is that in a world of unicode, "reversing a string" is no longer possible or meaningful.

You'll probably be told "oh, assume US ASCII" or something, but in the meantime, if you can back that up when they dig into it, you'll look really smart.

jameshart4y ago

I'd go further and argue that in general reversing a string isn't possible or meaningful.

It's just not a thing people do, so it's just... not very interesting to argue about what the 'correct' way to do it is.

Similarly, any argument over whether a string has n characters or n+1 characters in it is almost entirely meaningless and uninteresting for real world string processing problems. Allow me to let you into a secret:

there's never really such a thing as a 'character limit'

There might be a 'printable character width' limit; or there might be a 'number of bytes of storage' limit. Which means interesting questions about a string include things like 'how wide is it when displayed in this font?' or 'how many bytes does it take to store or transmit it?'... But there's rarely any point where, for a general string, it is really interesting to know 'how many characters does the string contain?'

Processing direct user text input is the only situation where you really need a rich notion of 'character', because you need to have a clear sense of what will happen if the user moves a cursor using a left or right arrow, and for exactly what will be deleted when a user hits backspace, or copied/cut and pasted when they operate on a selection. The ĳ ligature might be a single glyph, but is it a single character? When does it matter? Probably not at all unless you're trying to decide whether to let a user put a cursor in the middle of it or not.

And next to that, I just feel to argue that there is such a thing as a 'correct' way to reverse "Rĳndæl" according to a strict reading of Unicode glyph composability rules seems like a supremely silly thing to try to do.

I'd much rather, when asked to reverse a string, more developers simply said 'that doesn't make sense, you can't arbitrarily chunk up a string and reassemble it in a different order and expect any good to come of it'.

jerf4y ago

Boy, that's implicitly a good question... when's the last time I "reversed" a string, on purpose, for something useful?

It took me a bit, but I think I have an answer. It's about 15 years ago. I didn't actually do the original design, but I perpetuated it and didn't remove it. We reversed domain name strings (which, given that they are a subset of ASCII, actually is a well-defined operation) so that the DB we're using, which supported efficient prefix lookups but not suffix lookups, could be used to efficiently query for all subdomains of a given domain, by reversing the domain and using that as the prefix.

I mean this as strong support for your point, not a contradictory "gotcha". I'm a big believer in not doing lots of work to save effort or make correct something you do less than once a decade, e.g., http://www.jerf.org/iri/post/2954 . And it's not even a gotcha anyhow, because we aren't reversing a general string; we were reversing a string very tightly constrained to a subset of ASCII where the operation was fully well-defined. I can't think of when I ever reversed a general string.

1 more reply

Someone4y ago

Even ASCII can be argued to be problematic.

What is “3 >= 2", reversed?

What is “Rijksmuseum”, reversed? (https://en.wikipedia.org/wiki/IJ_(digraph); capitalization isn’t simple here, either (https://en.wikipedia.org/wiki/IJ_(digraph)#Capitalisation)

What is “Schroeder”, reversed? (https://en.wikipedia.org/wiki/Diaeresis_(diacritic)#Printing...)

Spivak4y ago

Reversing a string is still meaningful. Take a step back outside the implementation and imagine handing a Unicode string to a human. They could without any knowledge look at the characters they see and produce the correct string reversal.

There is a solution to this which is to compute the list of grapheme clusters, and reverse that.

https://unicode.org/reports/tr29/

akersten4y ago

> imagine handing a Unicode string to a human. They could without any knowledge look at the characters they see and produce the correct string reversal.

I really highly doubt it.

How do you reverse this?: مرحبًا ، هذه سلسلة.

Can you do it without any knowledge about whether what looks like one character is actually a special case joiner between two adjacent codepoints that only happens in one direction? Can you do it without knowing that this string appears wrongly in the HN textbbox due to an apparent RTL issue?

It's just not well-defined to reverse a string, and the reason we say it's not meaningful is that no User Story ever starts "as a visitor to this website I want to be able to see this string in opposite order, no not just that all the bytes are reversed, but you know what I mean."

4 more replies

chrismorgan4y ago

UAX #29 is insufficient: at the very least, you must depend on collation too.

In Norwegian, “æ” is a letter, so I believe (as a non-speaker) that they would reverse “blåbærene” to “eneræbålb”; but in English, it’s a ligature representing the diphthong “ae”, and if asked to reverse “æsthetic” I would certainly write “citehtsea” and consider “citehtsæ” to be wrong. (And I enjoy writing the ligature; I fairly consistently write and type æsthetic rather than aesthetic, though I only write encyclopædia instead of encyclopaedia when I’m in a particular sort of mood.)

In Dutch, the digraph “ij” is sometimes considered a ligature and sometimes a letter; as a non-speaker, I don’t know whether natives would say that it should be treated as an atom in reversing or not.

And not all languages will have the concept of reversing even letters, let alone other things. Face it: in English we have the concept of reversing things, but it just doesn’t work the same way in other languages. Sure, UAX #29 defines something that happens to be a good heuristic for reversing, but it doesn’t define reversing, and in the grand scheme of things reversing grapheme-wise is still Wrong. “Reversing a string” is just not a globally meaningful concept.

Another person here has cited Cherokee transliteration, where one extended grapheme cluster turns into multiple English letters. You can apply this to translation in general, but also even keep it inside English and ask: what are we reversing? Letters? Phonemes? Syllables? Words? There are plenty of possibilities which are used in different contexts (and it’s mostly in puzzles, frankly, not general day-to-day life).

The concept of grapheme clusters is acknowledged as approximate. Collations are acknowledged as approximate. Reversing would be even more approximate.

1 more reply

lloeki4y ago

Should it reverse a BOM as well or keep it first?

2 more replies

viktorcode4y ago

You certainly can. `print(String(flag.reversed()))` in Swift reverses emojis correctly.

account424y ago

How does it handle the ASCII examples in https://news.ycombinator.com/item?id=30108184

And more importantly: What is the use case for a reversed string?

1 more reply

paxys4y ago

UTF-8 reverse string has been a thing for a long time in most/all programming languages. It may not work perfectly in 100% of the cases, but that doesn't mean reversing a string is no longer possible.

jerf4y ago

"It may not work perfectly in 100% of the cases, but that doesn't mean reversing a string is no longer possible."

It depends on your point of view. From a strict point of view, it does exactly mean it is no longer possible. By contrast, we all 100% knew what reversing an ASCII string meant, with no ambiguity.

It also depends on the version of Unicode you are using, and oh by the way, unicode strings do not come annotated with the version they are in. Since it's supposed to be backwards compatible hopefully the latest works, but I'd be unsurprised if someone can name something whose correct reversal depends on the version of Unicode. And, if not now, then in some later not-yet-existing pair of Unicode standards.

1 more reply

jcelerier4y ago

> It may not work perfectly in 100% of the cases, but that doesn't mean reversing a string is no longer possible.

I don't understand why in maths finding one single counter-example is enough to disprove a theorem yet in programming people seem to be happy with 99.x % of success rate. To me, "It may not work perfectly in 100% of the cases" exactly means "no longer possible" as "possible" used to imply that it would work consistently, 100% of the time.

3 more replies

greenyoda4y ago

> "reversing a string" is no longer possible or meaningful.

If you really wanted to, you could write a string reversal algorithm that treated two-character emojis as an indivisible element of the string and preserved its order (just as you'd need to preserve the order of the bytes in a single multi-byte UTF-8 character). You'd just need to carefully specify what you mean by the terms "string", "character" and "reverse" in a way that includes ordered, multi-character sequences like flag emojis.

happytoexplain4y ago

I would argue that it is possible and meaningful. AFAIK extended grapheme clusters are well defined by the standard, and are very well suited to the default meaning of when somebody says "character", so, given no other information, it's reasonable to reverse a string based on them. I guess the issue is "reverse a string" lacks details, but I think that's different from "not meaningful".

spicybright4y ago

Sure it is, just render the same string in right to left!

emodendroket4y ago· 10 in thread

What I'd like to know is, given the explosion of the character set for emoji, does the rationale for Han unification still make sense? The case for not allowing national variants seems less and less compelling with every emoji they add.

This is a bit of a hobby horse, but imagine if every time you read an article in English on your phone some of the letters were replaced with "equivalent" Greek or Cyrillic one and you can get an idea of the annoyance. Yeah, you can still read it with a bit of thought, but who wants to read that way?

AlanYx4y ago

I agree that Han unification was an unfortunate design decision, but I'd argue that the consortium is following a consistent approach to the Han unification with emoji. For example, they treat "regional" vendor variations in emoji as a font issue. If you get a message with the gun emoji, unless you have out-of-band information regarding which vendor variant is intended, there's no way in software to know if it should be displayed as a water gun (Apple "regional" variant) or a weapon (other vendor variants). Which is not that different from a common problem stemming from Han unification.

emodendroket4y ago

I don't disagree, but my point is more than their concern was about having "too many characters" in Unicode, which no longer seems to be a real concern, so what would be the harm of adding national variants?

account424y ago

Have skin tone variants (which is somethine Unicode chose to add rather than added because of existing use) is consistent with not have distinct variants for glyphs from different languages?

fomine34y ago

Han unification was a try to fit CJK characters into 16bit BMP. Finally BMP is failed so meaningless but reverting it also produces huge compatibility issue.

emodendroket4y ago

Of course, the old characters must be left alone. But I'm not seeing what stops them from introducing new ones.

1 more reply

digisign4y ago

> were replaced with "equivalent" Greek or Cyrillic one

The subset of equivalent letters, or different ones? If they looked the same, it wouldn't bother me if the letters in the center were a single codepoint between European languages:

https://upload.wikimedia.org/wikipedia/commons/8/84/Venn_dia...

account424y ago

I am disappointed that that diagram omits ꙮ [0]

[0] https://en.wikipedia.org/wiki/Multiocular_O

emodendroket4y ago

The problem is they don't look the same. So imagine, for instance, Я instead of "R" or И instead of "N" (I don't think the sounds are actually equivalent but let's run with it for the sake of example). Not insurmountable. One could still read a text with these substitutions. But it'd be distracting, and extra detrimental for people who don't speak English as their first language.

1 more reply

shalmanese4y ago

It doesn't make sense but there's also no way to fix it now. Once the Han characters were unified, there's no non-trivial way to ununify them.

emodendroket4y ago

To an extent that's true, but introducing national variant characters in addition to the unified ones would at least allow careful writers to avoid the problem.

1 more reply

treesknees4y ago· 8 in thread

But you can, and did, reverse a string. It seems you would need more details, such as a request to reverse the meaning or interpretation of the string, which is what the author is getting at.

If someone challenges you to reverse an image, what do you do? Do you invert the colors? Mirror horizontally? Mirror vertically? Just reverse the byte order?

wahern4y ago

There's a specification problem here. I like to say that a "string" isn't a data structure, it's the absence of one. Discussing "strings" is pointless. It follows that comparing programming languages by their "string" handling is likewise pointless.

Case in point: a "struct" in languages like C and Rust is literally a specification of how to treat segments of a "string" of contiguous bytes.

avianlyric4y ago

In languages like C “string” isn’t a proper data structure, it’s a `char` array, which itself is little more than a `int` array or `byte` array.

But these languages don’t provide true “string” support. They just have a vaguely useful type alias that renames a byte array to a char array, and a bunch of byte array functions that have been renamed to sound like string functions. In reality all the language supports are byte arrays, with some syntactical sugar so you can pretend they’re strings.

Newer languages, like go and Python 3, that where created in the world of Unicode provide true string types. Where the type primitives properly deal with idea of variable length characters, and provide tools to make it easy to manipulate strings and characters as independent concepts. If you want to ignore Unicode, because your specific application doesn’t need to understand, then you need cast your strings into byte arrays, and all pretences of true string manipulation vanish at the same time.

This is not to say the C can’t handle Unicode etc. just like the language doesn’t provide true primitives to manipulate strings, instead relies on libraries to provide that functionality, which is perfectly valid approach. Just as baking in more complex string primitives into your language is also a perfectly valid approach. It’s just a question of trade offs and use cases, I.e. the problem at the heart of all good engineering.

1 more reply

shadowgovt4y ago

Even the most basic ASCII string is still a data structure.

Is it a PASCAL string (length byte followed by data) or a C string (arbitrary run of bytes terminated by a null character)?

1 more reply

jameshart4y ago

Yep, it's as meaningful a programming task as 'reverse this double-precision float'.

samatman4y ago

We would all be better off if this were actually true.

Tragically, in C, a string is just barely a data structure, because it must have \0 at the end.

If it were the complete absence of a data structure, we would need some way to get at the length of it, and could treat a slice of it as the same sort of thing as the thing itself.

2 more replies

egypturnash4y ago

Galaxy brain image reversal: completely redraw it from scratch, with a viewpoint 180º from the original.

McBeige4y ago

If the FoV is less than 180deg then any image would be a realistic solution as long as it doesn't depict anything from the original.

1 more reply

ravi-delia4y ago

New computer vision challenge

yoyohello134y ago· 6 in thread

Maybe I'm missing some prerequisite knowledge here, but why would I assume `flag="us"` is an emoji? Looking at that first block of code, there is no reason for me to think "us" is a single character.

Edit: Turns out my browser wasn't rendering the flags.

happytoexplain4y ago

In Windows Chrome, it doesn't render the emoji for me. In Android Chrome, it renders a flag emoji - not the raw region indicators (which look like the letters "u" and "s").

greenyoda4y ago

In my browser (Firefox on Windows), the thing between the quotes in the first block of code looks like a picture of the US flag cropped to a circle, not like the characters "us".

yoyohello134y ago

Ah I see, I just opened it in firefox. It looks like some JS library is not getting loaded in Edge. The author was talking about "us", "so", etc. looking like one character and I thought I was going crazy, lol.

4 more replies

ljm4y ago

If it's Windows, it doesn't actually use flags for those emojis, it renders a country code instead. If it wasn't supported you would just see the glyph for an unknown character.

The reason was because they didn't want to be caught up in any arguments about what flag to render for a country during any dispute, as with, e.g. the flag for Afghanistan after the Taliban took control.

kingcharles4y ago

Do you have a citation for that? I suspected it was because of the political issues, so I tried hunting down the reason one day and came up blank.

[Microsoft had this same issue with the timezone map in Windows. The early versions were cool and had country borders, but then I think it was India/Pakistan threw a fit and it was simplified to take the borders out]

Benlights4y ago

I had the same issue when I read the article, I kept on getting stuck and asking myself what I was missing.

otagekki4y ago· 6 in thread

If flag emojis are really a combination of 2 special characters, the reversal of the U.S. flag should result in having the Soviet Union flag.

masklinn4y ago

> the reversal of the U.S. flag should result in having the Soviet Union flag.

Except it has been deleted from the ISO 3166-2 registry, so not having it is perfectly valid (arguably more so than having it).

account424y ago

No, that only shows that the ISO 3166-2 registry is a bad basis for Unicode flags since having things lose meaning over time should not be acceptable for a text encoding.

Flags have another issue here in that they can change even when the country stays the same - a recent example here being Afghanistan, but also France who recently changed the official shades of the colors in their flag. Ideally you'd want a new Unicode representation for any changed flags in order to not retroactively change the meaning in old documents.

brewmarche4y ago

Just tried reversing a Spanish flag with Python and indeed I got Sweden back

kingcharles4y ago

No-one expects the Swedish flag!

TonyTrapp4y ago

It's up to the installed fonts really. I don't know if the combination of S + U is standardized as a Soviet Union flag emoji, but even if it is, your locally installed fonts may not contain every single flag emoji, so the browser would still fall back to rendering the two letters instead.

jameshart4y ago

I was so disappointed that didn't turn out to be the case.

happytoexplain4y ago· 6 in thread

I guessed that it would become the USSR flag (US -> SU), but apparently Unicode doesn't define that one! I wonder why. That would have been humorous.

ts4z4y ago

IIRC Unicode doesn't define country codes. It was a workaround for a political issue of which countries recognize which other countries.

It would have been difficult to get the CN delegation to sign off on a list that contained TW, although there are probably others.

andylynch4y ago

There are many more than I realised - Wikipedia has a decent list https://en.m.wikipedia.org/wiki/List_of_states_with_limited_...

bloak4y ago

As I understand it, there is no two-letter ISO code for the USSR because when they update the standard they remove countries that no longer exist. In at least one case they have reused a code point: CS has been both "Czechoslovakia" and "Serbia and Montenegro", neither of which currently exist.

As a result, two-letter ISO codes are useless for many potential applications, such as, for example, recording which country a book was published in, unless you supplement them with a reference to a particular version of the standard.

Is there a way of getting the Czechoslovakian flag as an emoji? And did Serbia and Montenegro get round to making a flag?

happytoexplain4y ago

Ah, I didn't realize they reused codes from ISO 3166-3. I figured, because they keep these regions around in their own set, that was some implication that the codes would not be reused.

chungy4y ago

Unicode doesn't define any flags, really. That's up to the font rendering on systems/libraries.

happytoexplain4y ago

True, but Unicode explicitly defines "SU" as a deprecated combination, regardless of flags. Seems like they omit everything from the list of "no longer used" country codes, with some exceptions. I would think they would have no reason not to allow historical regions.

kevin_thibedeau4y ago· 5 in thread

This misses the real problem with flag emoji in that they are composed of codepoints that can be in any order. With other emoji you get a base codepoint with potential combining characters. Using a table of combining character ranges you can skip over them and isolate the logical glyph sequences. You don't need surrounding context to parse them out like flags need.

jug4y ago

I think that somewhere in this answer lies a reason why Windows still doesn't support flag emoji. I don't count Microsoft Edge as "Windows" in this case, but as Chromium. Windows doesn't support flag emoji in its native text boxes, but it does support even colorized emoji.

But then again, flags seem to be not only Unicode-hard but post-Unicode-hard.

kingcharles4y ago

Flags are political. Microsoft has removed country borders from its products for political reasons, a post above says the flags rendering was excluded for the same reason.

masklinn4y ago

> But then again, flags seem to be not only Unicode-hard but post-Unicode-hard.

Flags are not that hard, they're a very specific block combining in very predictable way. They're little more than ligatures. Family emoji are much harder.

And this is not "post-Unicode" in any way.

2 more replies

uniqueuid4y ago

Thanks for that interesting detail!

If such re-purposing continues, it might be easier to go straight to utf-32 for some use cases.

dhosek4y ago

Nope, because the repurposing is independent of how the Unicode is represented. There's absolutely no advantage to having a string in UTF-32 over UTF-8 since you'll still need to examine every character and the added overhead for converting byte strings in UTF-8 to 32-bit code points is by far offset by the huge memory increase necessary to store UTF-32.

What's more, it's really not that difficult to start at the end of a valid UTF-8 string and get the characters in reverse order. UTF-8 is well-designed that way in that there's never ambiguity about whether you're looking at the beginning byte of a code point.

1 more reply

Beldin4y ago· 4 in thread

Interestingly, on my phone the so-called flag is not a flag at all, but "US" in outline.

So python behaves as expected: the 2 character string, when reversed, becomes "SU". Similar stuff happens with the other "flag" strings.

I'm sure emojis in my phone are outdated. I'm not sure how that affects whether I see a flag or letters.

pilsetnieks4y ago

Thankfully, there isn't an assigned ISO 3166-1 2-letter country code for SU currently; people may have interesting reactions seeing what happens when reversing a US flag emoji if there were.

easrng4y ago

If this was 1990 (and we somehow had the current emoji standard) SU would be the USSR flag.

kingcharles4y ago

Out of interest, what phone and browser? The only platform I've seen that doesn't render the flags is Windows.

Beldin4y ago

An android phone from 2014, with a year out of date chrome.

To update chrome, I'd have to give it permission to access my contacts. That ain't happening. (Phone OS is too old for per-app permissions)

WA9ACE4y ago· 3 in thread

I feel like I'm obligated to share this almost 20 year old Spolsky post that gave me my understanding of characters.

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

xmprt4y ago

In that same vein, here's my introduction to Unicode about 10 years ago from Tom Scott.

https://www.youtube.com/watch?v=MijmeoH9LT4

ciupicri4y ago

That's more about the UTF-8 encoding than Unicode itself.

zerox7felf4y ago

poor man gave me and many others something like half of our introduction to computer science, but has gotten far more fame as the "emoji guy" for his repeated bouts with this particular part of unicode :)

xmprt4y ago· 3 in thread

This is a cool article about Unicode encoding however I still feel like it should be possible to reverse strings with Flag emojis. I don't see why computers can't handle multi rune symbols in the same way that they handle multi byte runes. We could combine all the runes that should be a single symbol and make sure that we're maintaining the ordering of those runes in the reversed string. Of course that means that naive string reversing doesn't work anymore but naive string reversing wouldn't work in the world of UTF-8 if we just went byte by byte.

happytoexplain4y ago

Swift, for example, does what you're saying. I thought that the reason many languages don't do it that way is that part of the definition of an array (or at least expected-by-convention) is constant-time operations. If you treat a string as an array, then having to deal with variable-length units breaks that rule. That's why, when there is an API for dealing with grapheme clusters, it is usually a special case that duplicates an array-like API, instead of literally using an array.

I actually don't know how/why Python is apparently using code points, since they are variable length. That seems like a compromise between using code units and using grapheme clusters that gets you the worst of both worlds.

Edit: Maybe it uses UTF-32 under the hood when it's doing array operations on code points?

nitely4y ago

CPython 3 does use UTF-32 under the hood for strings (there is bytes for plain sequence of bytes). As you say, it's the worst of both worlds. High memory usage, and not really useful if you are dealing with unicode characters (grapheme clusters).

My impression is most modern languages that bother with unicode (swift, rust, nim) are using utf-8, and doing linear time operations to handle unicode. I think that's the right approach, as I don't recall ever needing random access on a unicode string.

1 more reply

nitely4y ago

Of course it's possible, the Unicode standard even has a table[0] you can use to build a DFA (Deterministic Finite Automata) to break up a string into grapheme clusters. You can reverse the DFA to match and yield the graphemes backwards as well, which will give you the reversed unicode string.

[0]: http://www.unicode.org/reports/tr29/#Table_Combining_Char_Se...

sltkr4y ago· 3 in thread

So what was the deal with the Scottish flag?

gsnedders4y ago

From Wikipedia:

> A separate mechanism (emoji tag sequences) is used for regional flags, such as England 󠁧󠁢󠁥󠁮󠁧󠁿, Scotland 󠁧󠁢󠁳󠁣󠁴󠁿, Wales 󠁧󠁢󠁷󠁬󠁳󠁿, Texas 󠁵󠁳󠁴󠁸󠁿 or California 󠁵󠁳󠁣󠁡󠁿. It uses U+1F3F4 WAVING BLACK FLAG and formatting tag characters instead of regional indicator symbols. It is based on ISO 3166-2 regions with hyphen removed and lowercase, e.g. GB-ENG → gbeng, terminating with U+E007F CANCEL TAG. Flag of England is therefore represented by a sequence U+1F3F4, U+E0067, U+E0062, U+E0065, U+E006E, U+E0067, U+E007F.

ghostly_s4y ago

This was the only part that was surprising to me, and as it turns out my surprise mostly stems from still not really understanding how the United Kingdom works.

2 more replies

dhosek4y ago

Most flags use the ISO 2-character country code to access their values. However, some flags don't map to 2-character country codes (Scotland being one example). In this case it uses the sequence black flag, GBSCT (for Great Britain-Scotland, represented using the tag latin small letter codes for the letters) then cancel tag to end the sequence. Changing the middle five to be GBENG gives the English flag and GBWLS gives the Welsh flag.

dhosek4y ago· 2 in thread

On the challenge front, there are things like á which might be a single code point or two code points (a+´). Then there are the really challenging things like ᾷ where if the components are individual characters, the order of ͺ and ῀ are not guaranteed to be consistent.

happytoexplain4y ago

Which is why these APIs should always make normalization available: https://unicode.org/reports/tr15/

saltminer4y ago

Then you have stuff like zalgo text (http://eeemo.net/) which takes pride in abusing code points

codezero4y ago· 2 in thread

You also can't URL Encode a string (In JS at least) if you truncate an emoji at the beginning or end of it.

account424y ago

URL Encoding works on bytes and does not concern itself with the character encoding of those bytes (except assuming that it is an ASCII superset) so this is only a limitation of the JS implementation.

codezero4y ago

JS isn’t the only language that does this.

qwerty4561274y ago· 2 in thread

If the US flag is 2 special symbols saying US, why doesn't reversing it just produce the flag of the Soviet Union?

NullPrefix4y ago

Same reason why there is no Nazi Germany flag - they are not included in Unicode.

qwerty4561274y ago

Quite unfortunate - the US/SU case would make a nice Unicode+political pun.

smegsicle4y ago· 2 in thread

did they think all those skintone emojis are individual codepoints?

daveslash4y ago

When I first realized that the skin tone emojis were a code-point + a color code-point modifier, I tried to see what other colors there were and if I could apply those to other emojis. The immature child in me looked to see if there was a red color code point and if so, could I use it to make a "blood poop" emoji. Turns out.... no.

advisedwang4y ago

They might have thought that `reverse()` had some kind of unicode-aware handling. I believe `upper()`/`lower()` do.

zarzavat4y ago· 1 in thread

This reminds me of an interesting bug I saw where I was seeing a strange flag in some Arabic text. However when I copied the string and pasted it into a text editor, the flag of Saudi Arabia appeared instead (which made much more sense). After some vexillologic research on Wikipedia I identified the original flag as American Samoa and it suddenly all made sense. Turns out some broken RTL support was flipping the SA into AS at presentation.

zarzavat4y ago

After writing this comment I did some more research. Apparently this is actually a bug in Chrome itself (!).

https://bugs.chromium.org/p/chromium/issues/detail?id=127243...

tl4y ago· 1 in thread

This is a nice dive into limitations in Python's unicode handling and at the end, how to work around some problems. But you could use languages with proper unicode support like Swift or Elixir (weirdly HN is fighting flags in comment code which makes examples header to demonstrate).

anamexis4y ago

HN doesn't allow any emoji.

progbits4y ago· 1 in thread

Semi-related (about length of emoji "characters", not reversing): https://hsivonen.fi/string-length/

Previously discussed:

https://news.ycombinator.com/item?id=20914184

https://news.ycombinator.com/item?id=26591373

As for this article & Python - as usual it is biasing towards convenience and implicit behavior rather than properly handling all edge cases.

Compare with Rust where you can't "reverse" a string - that is not a defined operation. But you can either break it into a sequence of characters or graphemes and then reverse that, with expected results: https://play.rust-lang.org/?version=stable&mode=debug&editio...

(Sadly the grapheme segmentation is not part of standard library, at least yet)

account424y ago

> Sadly the grapheme segmentation is not part of standard library, at least yet

Seeing as grapheme segmentation is a moving target that only makes sense.

cmyr4y ago· 1 in thread

Something I haven't seen mentioned yet is one of the most annoying things about regional indicator symbols, which is that interpreting them correctly requires arbitrary backtracking, and handling this correctly is very annoying for things like text fields.

Basically: A single, unpaired RIS counts as a single grapheme. Similarly, a pair of RIS count as a single grapheme. Now imagine if your cursor position is after an RTS, and you arrow backwards (assuming LTR text, imagine your cursor is to the right of an RIS, and you press the left arrow.) Your textbox should now move the cursor to the left by one grapheme. How do you figure out where this is, in code units? You basically have to scan backwards until you find the first non-RIS codepoint, and then you have to match them up into pairs to figure out if your left-arrow movement should correspond to a movement of one or two codepoints.

This is a longstanding source of bugs, and if you're bored you can play around with pasting a huge sequence of flags into a textfield and then trying to navigate around it with the arrow keys/mouse. There are some broken implementations out there.

edit: while I'm thinking about this I will point out that an alternative design, which would have solved this problem (and which was first pointed out to me by @raphlinus) would have been to have two separate sets of RI symbols, one for 'first position' and one for 'second position'; then you could always determine the appropriate cursor position without needing context. Isn't hindsight a wonderful thing?

account424y ago

> Isn't hindsight a wonderful thing?

Gladly, the creators of UTF-18 did have that foresight so at least we don't have this problem at the code unit -> code point level.

qqii4y ago· 1 in thread

> Challenge: How would you go about writing a function that reverses a string while leaving symbols encoded as sequences of code points intact? Can you do it from scratch? Is there a package available in your language that can do it for you? How did that package solve the problem?

So are there any good libraries that can deal with code points that are merged together into a single pictographic and reverse them "as expected"?

da12OP4y ago

If you're using Python, check out grapheme: https://github.com/alvinlindstam/grapheme

ineedasername4y ago· 1 in thread

It's an emoji... Are there any emojis with only one character? My assumption going in would be that any emoji is > 1 character. Admittedly, despite lots of string processing, I never have to deal with emojis so I guess I'm not sure.

An interesting exercise would be emoji detection during string reversal to preserve the original emoji. I though something like that would be the crux of the article.

Am I wrong about single character emojis?

easrng4y ago

It depends what you mean by character, there are lots of single codepoint emojis though.

faebi4y ago· 1 in thread

Why reverse them if one barely can implement, display and edit them correctly. I never could make them work perfectly in VIM. Also I had to open a bug in Firefox recently:

Flag emojis and others are displayed in double the size on Windows 10 using Firefox Nightly https://bugzilla.mozilla.org/show_bug.cgi?id=1746795

easrng4y ago

Windows doesn't even have flag emojis, they just show up as the country code.

Edit: Actually Firefox ships a copy of twemoji for fallback purposes, so flags will still render.

jug4y ago· 1 in thread

I'm not surprised the flag had two components, but I _was_ surprised the US flag was made by literally U and S, haha!

I definitely thought it'd be something like [I am a Flag] and [The flag ID between 0 and 65535]. And reversing it would be [Flag ID] + [I am a Flag] which would not be a defined "component" and instead rendered as the individual two nonsense characters.

andylynch4y ago

You might also have noticed this is partly a very well thought out hack to make Unicode less sensitive to disagreements and changes in consensus on which flags are encoded, or even the names of the countries concerned!

bandyaboot4y ago· 1 in thread

Would be interesting to see the list of flag emojis that, when reversed, become a different flag emoji.

jfk134y ago

There are plenty of country codes that when reversed become a different, valid country code: e.g. Israel (IL) when reversed is Lithuania (LI); Australia (AU) becomes Ukraine (UA).

Whether "reversing flag emojis" causes such transformations will depend on what is meant by "reversing", which is kind of the whole point here: there are a number of possible interpretations of "reverse".

demetrius4y ago· 1 in thread

It's sad that Unicode doesn't include flags for dissolved countries. If it did, reversing an US flag would make a Soviet Union flag (code SU). This would make the text much more fun

kragen4y ago

The whole reason for handling the flag emojis that way was so that the Unicode Consortium wouldn't have to decide which countries should or should not be recognized. It is totally valid for you to configure your computer to display SU as a Soviet flag.

architectdrone4y ago· 1 in thread

humorously, on my local machine, I only see the string "us", and was rather confused when he was asserting that it was a single character :D

kingcharles4y ago

You're on Windows? Windows doesn't render flag emojis as flags.

chrismorgan4y ago

UTF-8 does not represent Unicode code points, but rather Unicode scalar values. The difference between the two is surrogates, the way that UTF-16 ruined Unicode: code points are 0₁₆ to 10FFFF₁₆, scalar values are 0₁₆ to D7FF₁₆ and E000₁₆ to 10FFFF₁₆. Yes, the author quoted Wikipedia, but Wikipedia is wrong on this point; surprisingly comprehensively wrong: the UTF-8 page completely ignores the distinction, and even the page on code points doesn’t mention scalar values! This error propagates to other places, too: for example, “and there are a total of 1,112,064 possible code points”: no, that’s how many scalar values there are; code points also include the 2,048 surrogates, so there are 1,114,112 code points.

Mesopropithecus4y ago

Unfortunately the HN text input won't let me do this, but a funny starter for the article would have been this:

'(Spanish flag)'[::-1]

basically ''.join([chr(127466), chr(127480)]) vs. ''.join([chr(127466), chr(127480)])[::-1]

I'll add this to my collection of party tricks and show myself out.

Cool article!

Crazyontap4y ago

This section on the linked Wikipedia article(1) is quite amazing on how the family emoji is rendered using a zero-width joiner

(1) https://en.wikipedia.org/wiki/Emoji#Joining

edit: forgot HN doesn't render emojis. Better read it directly on Wikipedia i guess.

jiveturkey4y ago

Interesting article. Written for beginners, conversationally. Has excessive amounts of whitespace, for "readability" I guess. But at the same time, it dives quite deep, which I don't think this "style" of presentation matches up with the amount of time a more novice reader is going to devote to a single long form article.

As to the content, for all the deep dive, a simple link to https://unicode.org/reports/tr51/#Flags and what an emoji is, would have saved so much exposition. I also wish he'd touched on normalization. With the amount of time he's demanding from readers he could have mentioned this important subject. Because then he could discuss why (starting from his emoji example) a-grave (à) might or might not be reversible, depending how the character is composed.

Also wish he'd pointed to some libraries that can do such reversals.

utopcell4y ago

There are unicode characters that reverse parsing order themselves. This has been the basis of a code injection attack, analyzed in [1].

[1] ``Trojan Source: Invisible Vulnerabilities'': https://trojansource.codes/trojan-source.pdf

uniqueuid4y ago

Upper and lower codepoints are really way too obscure and can create issues you didn't even know you had.

I once had the very unpleasant experience of debugging a case where data saved with R on windows and loaded on macOS ended up with individually double-encoded codepoints.

Not fun.

techwiz1374y ago

It's pretty funny that reversing the American flag yields Soviet Union(SU).

sundarurfriend4y ago

Julia docs do a (surprisingly) good job of being clear and explicit about this: the docstring for `reverse(AbstractString)` says:

> Reverses a string. Technically, this function reverses the codepoints in a string and its main utility is for reversed-order string processing [...]. See also [...] `graphemes` from module Unicode to operate on user-visible "characters" (graphemes) rather than codepoints.

Properly reversing a string of flags (or any other grapheme clusters) is just a `using Unicode: grapheme` away.

mappu4y ago

If you like this, you may also like why len(emoji) is still not 1 in Python 3 despite all the unicode breakage: https://storytime.ivysaur.me/posts/grapheme-clusters/

I do feel like these are all 'gotcha' questions - I haven't seen any real-world requirement to reverse a string and then have it be displayed in a useful way.

a_c4y ago

Understanding unicode would make the question more obvious

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

Waterluvian4y ago

I wish languages did a far better job clearly distinguishing between their operations:

1. You are acting in byte space and it’s pretty unambiguous what should happen. We are not acknowledging the semantics of language and alphabets.

2. You’re acting in language space and these operations will behave the way you probably think they should (depending on your cultural expectations, probably)

codingkev4y ago

Yes, this allows for easy building of flag emojis as long as you know the ISO 3166 two-letter country code.

Example: https://github.com/kennell/flagz/blob/master/flagz.py

michaelsbradley4y ago

See chapter 7 in Hacking the Planet (with Notcurses) for a short treatment of encodings, extended grapheme clusters, etc.

https://nick-black.com/htp-notcurses.pdf#page53

raffy4y ago

Kinda related: I am developing a library for ENS (Ethereum Name Service) name normalization: https://github.com/adraffy/ens-normalize.js

I'm trying to find the best combination of UTS-46, UTS-51, UTS-39, and prior work on IDN resolution w/r/t confusables: https://adraffy.github.io/ens-normalize.js/test/report-confu...

Personally, I found the Unicode spec very messy. Critical information is all over the place. You can see the direct effect of this when you compare various packages across different languages and discover that every library disagrees in multiple places. Even JS String.normalize() isn't consistent in the latest version of most browsers: https://adraffy.github.io/ens-normalize.js/test/report-nf.ht... (fails in Chrome, Safari)

The major difference between ENS and DNS is emoji are front and center. ENS resolves by computing a hash of a name in a canonicalized form. Since resolution must happen decentralized, simply punting to punycode and relying custom logic for Unicode-handling isn't possible. On-chain records are 1:1, so there's no fuzzy matching either. Additionally, ENS is actively registering names, so any improvement to the system must preserve as many names as possible.

At the moment, I'm attempting to improve upon the confusables in the Common/Greek/Latin/Cyrillic scripts, and will combine these new grouping with the mixed-script limitations similar to IDN handling in Chromium.

Interactive Demo: https://adraffy.github.io/ens-normalize.js/test/resolver.htm...

Also this emoji report is pretty cool: https://adraffy.github.io/ens-normalize.js/test/report-emoji...

jupp0r4y ago

With all the criticism I normally have for Rust, I must say that its type safe handling of UTF-8 and its unambiguous distinction between byte strings and UTF-8 strings are extremely helpful in handling situations mentioned in the article correctly (and also efficiently).

Yes it's a pain, but the way the standard library designed its types force you to handle conversions correctly, for example when byte arrays are converted to UTF-8 strings and may contain invalid UTF-8 sequences.

aidenn04y ago

> The answer is: it depends. There isn't a canonical way to reverse a string, at least that I'm aware of.

Unicode defines grapheme clusters[1] that represent "user-perceived characters" separating a string into those and reversing seems like a pretty good way to go about it.

1: http://www.unicode.org/reports/tr29/

nextstep4y ago

Compare all of this nonsense to how it’s done in Swift. String APIs in Swift are great: intuitive and do what you expect.

alfredxing4y ago

Related — I did a deep dive a couple years ago on emoji codepoints and how they're encoded in the Apple emoji font file, with the end goal of extracting the embedded images — https://github.com/alfredxing/emoji

zwerdlds4y ago

In normal conditions you can check for a ZWJ, but with regional coding chars, you would have to consider the regional chars block as a single char in the reversal. Given that is isn't necessarily locale dependant but presentation layer dependant, there might not be anough info to decide how to act.

zanzibar7354y ago

Of course you can reverse a string with a flag emoji. You just need to treat a "string" as a collected of Extended Grapheme Clusters, and then you reverse the order of the EGCs. So if the string is `a<flag unicode bytes>b`, the output should be `b<flag unicode bytes>a`.

heystefan4y ago

Oooh I know this one, I've read it here last year: https://tonsky.me/blog/emoji/

ezfe4y ago

Works in Swift, which is the benefit of Swift having the most painful String API possible:

let v = "Flag: " String(v.reversed()) // Output: :galF v.count // Output: 7

nitely4y ago

You can, but you need to break the string into graphemes first.

kart234y ago

Google also interprets emojis funny. Google the Estonian and South Sudan flag (f"{chr(127466)*2+chr(127480)*2}") and you get results for Spain.

randpx4y ago

Try reversing the Canadian flag (CA) and you get the Ascension Island Flag (AC). Great article, but completely misses the point.

ts4z4y ago

Let me cheat a bit and say Unicode comes in three flavors: UTF-8, UCS-2 aka UTF-16, and UTF-32. UTF-8 is byte-oriented, UTF-16 is double-byte oriented, and UTF-32 nobody uses because you waste half the word almost all of the time.

You can't reduce the bytes in UTF-8 or UTF-16, because you'll scramble the encoding. But you could parsing the string, codepoint-at-a-time, handling the specifics of UTF-8, or UTF-16 with its surrogate pairs, and reversing those. This sounds equivalent to reversing UTF-32, and I believe is what the original poster was imagining.

Except you can't do that, because Unicode has composing characters. Now, I'm American and too stupid to type anything other than ASCII, but I know about n+~ = ñ. If you have the pre-composed version of ñ, you can reverse the codepoint (it's one codepoint). If you don't have it, and you have n+dead ~, you can't reverse it, or in the word "año" you might put the ~ on the "o". (Even crazier things happen when you get to the ligatures in Arabic; IIRC one of those is about 20 codepoints.)

So we can't just reverse codepoints, even ancient versions of Unicode. Other posters have talked about the even more exotic stuff like Emoji + skin tone. It's necessary to be very careful.

Now, the old fart in me says that ASCII never had this problem. But the old fart in me knows about CRLF in text protocols, and that's never LFCR; and that if you want to make a ñ in ASCII you must send n ^H ~. I guess you can reverse that, but if you want to do more exotic things it becomes less obvious.

(IIRC UCS-2 is the deadname, now we call it UTF-16 to remind us to always handle surrogate pairs correctly, which we don't.)

TLDR: Strings are hard.

mlindner4y ago

The person tries to define character when there isn't actually any definition of what that even means. Character is a term limited to languages that actually use them and not all text is made up of characters.

nottorp4y ago

So basically unicode along with c++ are great job security if you do bother to learn them.

There's another word that comes to mind when thinking about those two: metastasis.

hougaard4y ago

In other news, water is wet :)

exdsq4y ago

Am I missing something or is this Day 1 of a programming course in C?

midjji4y ago

And this is why char should have been byte from the start.

j / k navigate · click thread line to collapse

239 comments

161 comments · 58 top-level

coreyp_14y ago· 15 in thread

If you think the Unicode flag emoji take a lot of bytes, then consider the family emoji! (https://unicode.org/emoji/charts/full-emoji-list.html#family)

Moral of the story: Dealing with Unicode is hard, and if you think it shouldn't be that hard, then you probably don't know enough about the problem!

josephg4y ago

Handling unicode can be fine, depending on what you're doing. The hard parts are:

- Counting, rendering and collapsing grapheme clusters (like the flag emoji)

- Converting between legacy encodings (shiftjis, ko8, etc) and UTF-8 / UTF-16

- Canonicalization

If all you need is to deal with utf8 byte buffers, you don't need all that stuff. And your code can stay simple, small and fast.

By contrast, nodejs (and web browsers) do all of this. But they implement it in the same way you're suggesting - they simply call out to libicu.

tialaramex4y ago

> The only real unicode support in std is utf8 validation for strings.

Gigachad4y ago

It always feels like the most amount of work goes to the least used emoji. So many revisions and additions to the family emoji and yet it’s one of the ones I don’t recall anyone ever using.

I think the trap Unicode got in to is technically they can have infinite emoji so they just don’t ever have a way to say no to new proposals.

masklinn4y ago

> It always feels like the most amount of work goes to the least used emoji.

> I think the trap Unicode got in to is technically they can have infinite emoji so they just don’t ever have a way to say no to new proposals.

1 more reply

laumars4y ago

They do say no though. Frequently too.

The problem with Unicode is simply that it’s trying to solve a very hard problem.

1 more reply

jonas214y ago

Yes, this adds a lot of complexity, but it's really a question of whether that complexity is justified in order to support all of the world's languages. And I think many would argue that it is.

[1] https://en.wikipedia.org/wiki/Zero-width_joiner

1 more reply

Vindicis4y ago

DecoPerson4y ago

Do you have a YouTube for people to subscribe to in anticipation of you releasing your YouTube series about your work? The development processes of new languages is so intriguing.

coreyp_14y ago

I'll post about it here on HN after I have a few episodes up.

dagmx4y ago

It would actually be pretty interesting to see how you use Bison and Flex with utf-8. Most resources say to not bother due to lack of support for Unicode, but they're so ubiquitous

account424y ago

1 more reply

amelius4y ago

Why is this stuff even reinvented for every programming language?

Isn't it about time that we have some common language that every other language builds on?

johndough4y ago

> Isn't it about time that we have some common language that every other language builds on?

That language is C. It is debatable whether it was a good choice, but at least this is how it turned.

IncRnd4y ago

So said every writer of a standard immediately before writing another standard to replace all others.

1 more reply

lmm4y ago

ICU is out there with bindings in many languages. People who know what they're doing use ICU.

jerf4y ago· 15 in thread

You'll probably be told "oh, assume US ASCII" or something, but in the meantime, if you can back that up when they dig into it, you'll look really smart.

jameshart4y ago

I'd go further and argue that in general reversing a string isn't possible or meaningful.

It's just not a thing people do, so it's just... not very interesting to argue about what the 'correct' way to do it is.

there's never really such a thing as a 'character limit'

jerf4y ago

Boy, that's implicitly a good question... when's the last time I "reversed" a string, on purpose, for something useful?

1 more reply

Someone4y ago

Even ASCII can be argued to be problematic.

What is “3 >= 2", reversed?

What is “Rijksmuseum”, reversed? (https://en.wikipedia.org/wiki/IJ_(digraph); capitalization isn’t simple here, either (https://en.wikipedia.org/wiki/IJ_(digraph)#Capitalisation)

What is “Schroeder”, reversed? (https://en.wikipedia.org/wiki/Diaeresis_(diacritic)#Printing...)

Spivak4y ago

There is a solution to this which is to compute the list of grapheme clusters, and reverse that.

https://unicode.org/reports/tr29/

akersten4y ago

> imagine handing a Unicode string to a human. They could without any knowledge look at the characters they see and produce the correct string reversal.

I really highly doubt it.

How do you reverse this?: مرحبًا ، هذه سلسلة.

4 more replies

chrismorgan4y ago

UAX #29 is insufficient: at the very least, you must depend on collation too.

The concept of grapheme clusters is acknowledged as approximate. Collations are acknowledged as approximate. Reversing would be even more approximate.

1 more reply

lloeki4y ago

Should it reverse a BOM as well or keep it first?

2 more replies

viktorcode4y ago

You certainly can. `print(String(flag.reversed()))` in Swift reverses emojis correctly.

account424y ago

How does it handle the ASCII examples in https://news.ycombinator.com/item?id=30108184

And more importantly: What is the use case for a reversed string?

1 more reply

paxys4y ago

jerf4y ago

"It may not work perfectly in 100% of the cases, but that doesn't mean reversing a string is no longer possible."

It depends on your point of view. From a strict point of view, it does exactly mean it is no longer possible. By contrast, we all 100% knew what reversing an ASCII string meant, with no ambiguity.

1 more reply

jcelerier4y ago

> It may not work perfectly in 100% of the cases, but that doesn't mean reversing a string is no longer possible.

3 more replies

greenyoda4y ago

> "reversing a string" is no longer possible or meaningful.

happytoexplain4y ago

spicybright4y ago

Sure it is, just render the same string in right to left!

emodendroket4y ago· 10 in thread

AlanYx4y ago

emodendroket4y ago

account424y ago

Have skin tone variants (which is somethine Unicode chose to add rather than added because of existing use) is consistent with not have distinct variants for glyphs from different languages?

fomine34y ago

Han unification was a try to fit CJK characters into 16bit BMP. Finally BMP is failed so meaningless but reverting it also produces huge compatibility issue.

emodendroket4y ago

Of course, the old characters must be left alone. But I'm not seeing what stops them from introducing new ones.

1 more reply

digisign4y ago

> were replaced with "equivalent" Greek or Cyrillic one

The subset of equivalent letters, or different ones? If they looked the same, it wouldn't bother me if the letters in the center were a single codepoint between European languages:

https://upload.wikimedia.org/wikipedia/commons/8/84/Venn_dia...

account424y ago

I am disappointed that that diagram omits ꙮ [0]

[0] https://en.wikipedia.org/wiki/Multiocular_O

emodendroket4y ago

1 more reply

shalmanese4y ago

It doesn't make sense but there's also no way to fix it now. Once the Han characters were unified, there's no non-trivial way to ununify them.

emodendroket4y ago

To an extent that's true, but introducing national variant characters in addition to the unified ones would at least allow careful writers to avoid the problem.

1 more reply

treesknees4y ago· 8 in thread

But you can, and did, reverse a string. It seems you would need more details, such as a request to reverse the meaning or interpretation of the string, which is what the author is getting at.

If someone challenges you to reverse an image, what do you do? Do you invert the colors? Mirror horizontally? Mirror vertically? Just reverse the byte order?

wahern4y ago

Case in point: a "struct" in languages like C and Rust is literally a specification of how to treat segments of a "string" of contiguous bytes.

avianlyric4y ago

In languages like C “string” isn’t a proper data structure, it’s a `char` array, which itself is little more than a `int` array or `byte` array.

1 more reply

shadowgovt4y ago

Even the most basic ASCII string is still a data structure.

Is it a PASCAL string (length byte followed by data) or a C string (arbitrary run of bytes terminated by a null character)?

1 more reply

jameshart4y ago

Yep, it's as meaningful a programming task as 'reverse this double-precision float'.

samatman4y ago

We would all be better off if this were actually true.

Tragically, in C, a string is just barely a data structure, because it must have \0 at the end.

If it were the complete absence of a data structure, we would need some way to get at the length of it, and could treat a slice of it as the same sort of thing as the thing itself.

2 more replies

egypturnash4y ago

Galaxy brain image reversal: completely redraw it from scratch, with a viewpoint 180º from the original.

McBeige4y ago

If the FoV is less than 180deg then any image would be a realistic solution as long as it doesn't depict anything from the original.

1 more reply

ravi-delia4y ago

New computer vision challenge

yoyohello134y ago· 6 in thread

Maybe I'm missing some prerequisite knowledge here, but why would I assume `flag="us"` is an emoji? Looking at that first block of code, there is no reason for me to think "us" is a single character.

Edit: Turns out my browser wasn't rendering the flags.

happytoexplain4y ago

In Windows Chrome, it doesn't render the emoji for me. In Android Chrome, it renders a flag emoji - not the raw region indicators (which look like the letters "u" and "s").

greenyoda4y ago

In my browser (Firefox on Windows), the thing between the quotes in the first block of code looks like a picture of the US flag cropped to a circle, not like the characters "us".

yoyohello134y ago

4 more replies

ljm4y ago

If it's Windows, it doesn't actually use flags for those emojis, it renders a country code instead. If it wasn't supported you would just see the glyph for an unknown character.

kingcharles4y ago

Do you have a citation for that? I suspected it was because of the political issues, so I tried hunting down the reason one day and came up blank.

Benlights4y ago

I had the same issue when I read the article, I kept on getting stuck and asking myself what I was missing.

otagekki4y ago· 6 in thread

If flag emojis are really a combination of 2 special characters, the reversal of the U.S. flag should result in having the Soviet Union flag.

masklinn4y ago

> the reversal of the U.S. flag should result in having the Soviet Union flag.

Except it has been deleted from the ISO 3166-2 registry, so not having it is perfectly valid (arguably more so than having it).

account424y ago

No, that only shows that the ISO 3166-2 registry is a bad basis for Unicode flags since having things lose meaning over time should not be acceptable for a text encoding.

brewmarche4y ago

Just tried reversing a Spanish flag with Python and indeed I got Sweden back

kingcharles4y ago

No-one expects the Swedish flag!

TonyTrapp4y ago

jameshart4y ago

I was so disappointed that didn't turn out to be the case.

happytoexplain4y ago· 6 in thread

I guessed that it would become the USSR flag (US -> SU), but apparently Unicode doesn't define that one! I wonder why. That would have been humorous.

ts4z4y ago

IIRC Unicode doesn't define country codes. It was a workaround for a political issue of which countries recognize which other countries.

It would have been difficult to get the CN delegation to sign off on a list that contained TW, although there are probably others.

andylynch4y ago

There are many more than I realised - Wikipedia has a decent list https://en.m.wikipedia.org/wiki/List_of_states_with_limited_...

bloak4y ago

Is there a way of getting the Czechoslovakian flag as an emoji? And did Serbia and Montenegro get round to making a flag?

happytoexplain4y ago

Ah, I didn't realize they reused codes from ISO 3166-3. I figured, because they keep these regions around in their own set, that was some implication that the codes would not be reused.

chungy4y ago

Unicode doesn't define any flags, really. That's up to the font rendering on systems/libraries.

happytoexplain4y ago

kevin_thibedeau4y ago· 5 in thread

jug4y ago

But then again, flags seem to be not only Unicode-hard but post-Unicode-hard.

kingcharles4y ago

Flags are political. Microsoft has removed country borders from its products for political reasons, a post above says the flags rendering was excluded for the same reason.

masklinn4y ago

> But then again, flags seem to be not only Unicode-hard but post-Unicode-hard.

Flags are not that hard, they're a very specific block combining in very predictable way. They're little more than ligatures. Family emoji are much harder.

And this is not "post-Unicode" in any way.

2 more replies

uniqueuid4y ago

Thanks for that interesting detail!

If such re-purposing continues, it might be easier to go straight to utf-32 for some use cases.

dhosek4y ago

1 more reply

Beldin4y ago· 4 in thread

Interestingly, on my phone the so-called flag is not a flag at all, but "US" in outline.

So python behaves as expected: the 2 character string, when reversed, becomes "SU". Similar stuff happens with the other "flag" strings.

I'm sure emojis in my phone are outdated. I'm not sure how that affects whether I see a flag or letters.

pilsetnieks4y ago

Thankfully, there isn't an assigned ISO 3166-1 2-letter country code for SU currently; people may have interesting reactions seeing what happens when reversing a US flag emoji if there were.

easrng4y ago

If this was 1990 (and we somehow had the current emoji standard) SU would be the USSR flag.

kingcharles4y ago

Out of interest, what phone and browser? The only platform I've seen that doesn't render the flags is Windows.

Beldin4y ago

An android phone from 2014, with a year out of date chrome.

To update chrome, I'd have to give it permission to access my contacts. That ain't happening. (Phone OS is too old for per-app permissions)

WA9ACE4y ago· 3 in thread

I feel like I'm obligated to share this almost 20 year old Spolsky post that gave me my understanding of characters.

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

xmprt4y ago

In that same vein, here's my introduction to Unicode about 10 years ago from Tom Scott.

https://www.youtube.com/watch?v=MijmeoH9LT4

ciupicri4y ago

That's more about the UTF-8 encoding than Unicode itself.

zerox7felf4y ago

xmprt4y ago· 3 in thread

happytoexplain4y ago

Edit: Maybe it uses UTF-32 under the hood when it's doing array operations on code points?

nitely4y ago

1 more reply

nitely4y ago

[0]: http://www.unicode.org/reports/tr29/#Table_Combining_Char_Se...

sltkr4y ago· 3 in thread

So what was the deal with the Scottish flag?

gsnedders4y ago

From Wikipedia:

ghostly_s4y ago

This was the only part that was surprising to me, and as it turns out my surprise mostly stems from still not really understanding how the United Kingdom works.

2 more replies

dhosek4y ago

dhosek4y ago· 2 in thread

happytoexplain4y ago

Which is why these APIs should always make normalization available: https://unicode.org/reports/tr15/

saltminer4y ago

Then you have stuff like zalgo text (http://eeemo.net/) which takes pride in abusing code points

codezero4y ago· 2 in thread

You also can't URL Encode a string (In JS at least) if you truncate an emoji at the beginning or end of it.

account424y ago

codezero4y ago

JS isn’t the only language that does this.

qwerty4561274y ago· 2 in thread

If the US flag is 2 special symbols saying US, why doesn't reversing it just produce the flag of the Soviet Union?

NullPrefix4y ago

Same reason why there is no Nazi Germany flag - they are not included in Unicode.

qwerty4561274y ago

Quite unfortunate - the US/SU case would make a nice Unicode+political pun.

smegsicle4y ago· 2 in thread

did they think all those skintone emojis are individual codepoints?

daveslash4y ago

advisedwang4y ago

They might have thought that `reverse()` had some kind of unicode-aware handling. I believe `upper()`/`lower()` do.

zarzavat4y ago· 1 in thread

zarzavat4y ago

After writing this comment I did some more research. Apparently this is actually a bug in Chrome itself (!).

https://bugs.chromium.org/p/chromium/issues/detail?id=127243...

tl4y ago· 1 in thread

anamexis4y ago

HN doesn't allow any emoji.

progbits4y ago· 1 in thread

Semi-related (about length of emoji "characters", not reversing): https://hsivonen.fi/string-length/

Previously discussed:

https://news.ycombinator.com/item?id=20914184

https://news.ycombinator.com/item?id=26591373

As for this article & Python - as usual it is biasing towards convenience and implicit behavior rather than properly handling all edge cases.

(Sadly the grapheme segmentation is not part of standard library, at least yet)

account424y ago

> Sadly the grapheme segmentation is not part of standard library, at least yet

Seeing as grapheme segmentation is a moving target that only makes sense.

cmyr4y ago· 1 in thread

account424y ago

> Isn't hindsight a wonderful thing?

Gladly, the creators of UTF-18 did have that foresight so at least we don't have this problem at the code unit -> code point level.

qqii4y ago· 1 in thread

So are there any good libraries that can deal with code points that are merged together into a single pictographic and reverse them "as expected"?

da12OP4y ago

If you're using Python, check out grapheme: https://github.com/alvinlindstam/grapheme

ineedasername4y ago· 1 in thread

An interesting exercise would be emoji detection during string reversal to preserve the original emoji. I though something like that would be the crux of the article.

Am I wrong about single character emojis?

easrng4y ago

It depends what you mean by character, there are lots of single codepoint emojis though.

faebi4y ago· 1 in thread

Why reverse them if one barely can implement, display and edit them correctly. I never could make them work perfectly in VIM. Also I had to open a bug in Firefox recently:

Flag emojis and others are displayed in double the size on Windows 10 using Firefox Nightly https://bugzilla.mozilla.org/show_bug.cgi?id=1746795

easrng4y ago

Windows doesn't even have flag emojis, they just show up as the country code.

Edit: Actually Firefox ships a copy of twemoji for fallback purposes, so flags will still render.

jug4y ago· 1 in thread

I'm not surprised the flag had two components, but I _was_ surprised the US flag was made by literally U and S, haha!

andylynch4y ago

bandyaboot4y ago· 1 in thread

Would be interesting to see the list of flag emojis that, when reversed, become a different flag emoji.

jfk134y ago

There are plenty of country codes that when reversed become a different, valid country code: e.g. Israel (IL) when reversed is Lithuania (LI); Australia (AU) becomes Ukraine (UA).

demetrius4y ago· 1 in thread

It's sad that Unicode doesn't include flags for dissolved countries. If it did, reversing an US flag would make a Soviet Union flag (code SU). This would make the text much more fun

kragen4y ago

architectdrone4y ago· 1 in thread

humorously, on my local machine, I only see the string "us", and was rather confused when he was asserting that it was a single character :D

kingcharles4y ago

You're on Windows? Windows doesn't render flag emojis as flags.

chrismorgan4y ago

Mesopropithecus4y ago

Unfortunately the HN text input won't let me do this, but a funny starter for the article would have been this:

'(Spanish flag)'[::-1]

basically ''.join([chr(127466), chr(127480)]) vs. ''.join([chr(127466), chr(127480)])[::-1]

I'll add this to my collection of party tricks and show myself out.

Cool article!

Crazyontap4y ago

This section on the linked Wikipedia article(1) is quite amazing on how the family emoji is rendered using a zero-width joiner

(1) https://en.wikipedia.org/wiki/Emoji#Joining

edit: forgot HN doesn't render emojis. Better read it directly on Wikipedia i guess.

jiveturkey4y ago

Also wish he'd pointed to some libraries that can do such reversals.

utopcell4y ago

There are unicode characters that reverse parsing order themselves. This has been the basis of a code injection attack, analyzed in [1].

[1] ``Trojan Source: Invisible Vulnerabilities'': https://trojansource.codes/trojan-source.pdf

uniqueuid4y ago

Upper and lower codepoints are really way too obscure and can create issues you didn't even know you had.

I once had the very unpleasant experience of debugging a case where data saved with R on windows and loaded on macOS ended up with individually double-encoded codepoints.

Not fun.

techwiz1374y ago

It's pretty funny that reversing the American flag yields Soviet Union(SU).

sundarurfriend4y ago

Julia docs do a (surprisingly) good job of being clear and explicit about this: the docstring for `reverse(AbstractString)` says:

Properly reversing a string of flags (or any other grapheme clusters) is just a `using Unicode: grapheme` away.

mappu4y ago

If you like this, you may also like why len(emoji) is still not 1 in Python 3 despite all the unicode breakage: https://storytime.ivysaur.me/posts/grapheme-clusters/

I do feel like these are all 'gotcha' questions - I haven't seen any real-world requirement to reverse a string and then have it be displayed in a useful way.

a_c4y ago

Understanding unicode would make the question more obvious

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

Waterluvian4y ago

I wish languages did a far better job clearly distinguishing between their operations:

1. You are acting in byte space and it’s pretty unambiguous what should happen. We are not acknowledging the semantics of language and alphabets.

2. You’re acting in language space and these operations will behave the way you probably think they should (depending on your cultural expectations, probably)

codingkev4y ago

Yes, this allows for easy building of flag emojis as long as you know the ISO 3166 two-letter country code.

Example: https://github.com/kennell/flagz/blob/master/flagz.py

michaelsbradley4y ago

See chapter 7 in Hacking the Planet (with Notcurses) for a short treatment of encodings, extended grapheme clusters, etc.

https://nick-black.com/htp-notcurses.pdf#page53

raffy4y ago

Kinda related: I am developing a library for ENS (Ethereum Name Service) name normalization: https://github.com/adraffy/ens-normalize.js

I'm trying to find the best combination of UTS-46, UTS-51, UTS-39, and prior work on IDN resolution w/r/t confusables: https://adraffy.github.io/ens-normalize.js/test/report-confu...

Interactive Demo: https://adraffy.github.io/ens-normalize.js/test/resolver.htm...

Also this emoji report is pretty cool: https://adraffy.github.io/ens-normalize.js/test/report-emoji...

jupp0r4y ago

aidenn04y ago

> The answer is: it depends. There isn't a canonical way to reverse a string, at least that I'm aware of.

Unicode defines grapheme clusters[1] that represent "user-perceived characters" separating a string into those and reversing seems like a pretty good way to go about it.

1: http://www.unicode.org/reports/tr29/

nextstep4y ago

Compare all of this nonsense to how it’s done in Swift. String APIs in Swift are great: intuitive and do what you expect.

alfredxing4y ago

zwerdlds4y ago

zanzibar7354y ago

heystefan4y ago

Oooh I know this one, I've read it here last year: https://tonsky.me/blog/emoji/

ezfe4y ago

Works in Swift, which is the benefit of Swift having the most painful String API possible:

let v = "Flag: " String(v.reversed()) // Output: :galF v.count // Output: 7

nitely4y ago

You can, but you need to break the string into graphemes first.

kart234y ago

Google also interprets emojis funny. Google the Estonian and South Sudan flag (f"{chr(127466)*2+chr(127480)*2}") and you get results for Spain.

randpx4y ago

Try reversing the Canadian flag (CA) and you get the Ascension Island Flag (AC). Great article, but completely misses the point.

ts4z4y ago

So we can't just reverse codepoints, even ancient versions of Unicode. Other posters have talked about the even more exotic stuff like Emoji + skin tone. It's necessary to be very careful.

(IIRC UCS-2 is the deadname, now we call it UTF-16 to remind us to always handle surrogate pairs correctly, which we don't.)

TLDR: Strings are hard.

mlindner4y ago

nottorp4y ago

So basically unicode along with c++ are great job security if you do bother to learn them.

There's another word that comes to mind when thinking about those two: metastasis.

hougaard4y ago

In other news, water is wet :)

exdsq4y ago

Am I missing something or is this Day 1 of a programming course in C?

midjji4y ago

And this is why char should have been byte from the start.

j / k navigate · click thread line to collapse