But what about accidentally working-as-intended?
Sure it's a little trickier to read, but it's certainly not a "bug" that will cause any damage / danger / instability / etc.
Even the most strict definition of bug doesn't imply it has to "cause any damage / danger / instability / etc." to be one.
And I won't call it "work as intended" when the purpose of this feature is to provide an answer for human to read, and it failed on that.
An expression in French for this: "Tomber en marche" (literally: falling into walking). When something breaks we say it "tombe en panne" (falls into being out of service), when something works we say it "marche" (walks). So this expression is like "falling into a working state".
I wonder about the ratio of unknown bugs vs features that accidentally work, in the wild. Such features are time bombs waiting to explode during the next refactoring.
For example this shows an @: https://duckduckgo.com/?q=u0040&ia=answer
Because the text looked very odd to me I highlighted the nonsensical text "noitatneserper lausiv" and context-menu searched it on Google. To my surprise it googled for "visual representation", and while retrying because I thought that maybe Google's engine auto-"corrected" the text, I noticed that even the text in the context-menu stated that it would google for "visual representation".
Then seeing that it was "noitatneserper lausiv" in reverse, maybe also in combination from the first hit "U+202E RIGHT-TO-LEFT OVERRIDE - Unicode Explorer", it felt like the browser had done something it should not do by actually applying the reversion to the info box.
When inspecting the HTML tag of the info box it displays the string "‮ U+202E RIGHT-TO-LEFT OVERRIDE, decimal...", but whenever I try to do something with it, it get's eiter reversed or messed up.
Another bug: When I select the entire text in the info box, I get " U+202E RIGHT-TO-LEFT OVERRIDE, decimal: 8238, HTML: No visual representation, UTF-8: 0xE2 0x80 0xAE, block: General Punctuation" <-- (btw, this was NOT what I had first entered into the textfield before this edit)
And trying to append a double quote to the text above, it inserts it at the beginning of the line, actually after the E202+U. When I expand the textbox so that the entire paragraph is in one line, E202+U moves to the end.
All this is creepy and I bet that it won't be long until an exploit with this uncontrollable Unicode character will hit the first vulnerable servers and browsers. This feels like Unicode is playing with fire.
Edit: From https://unicode-explorer.com/c/202E
> The Right-To-Left Override character can be used to force a right-to-left direction withing a text. This is often abused by hackers to disguise file extensions: when using it in the file name my-text.'U+202E'cod.exe, the file name is actually displayed as my-text.exe.doc - so it seems to be a .doc file while in reality it is an .exe file. There's even an xkcd comic for this character!
So you might have something else going on if it shows up as "[object Object]" in your browser history.
Are there any good security guides/best practices for unicode sanitation?
There are legitimate uses of BiDi control characters. My favorite one from my time on Android was the string "Google+", which would render as "+Google" in an RTL paragraph. The translators would usually "fix" this by just flipping the string so that it was "+Google", which would render correctly, but be incorrect when cut'n'pasted, read by a screen reader, etc. The correct solution is to use a left-to-right mark. The string "Google\u{200e}+" renders correctly in both LTR and RTL flow. And these "mark" characters are basically harmless, they cannot profoundly change the order, they just fix some of these ambiguous cases.
Correct use of BiDi control characters is explained here: https://www.w3.org/International/questions/qa-bidi-unicode-c...
YES.
https://github.com/danielmiessler/SecLists/blob/master/Fuzzi...
# Human injection
#
# Strings which may cause human to reinterpret worldview
If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you.The report is divided into visual and non-visual security issues. Our old friend RTL override is covered, but mostly in the context of URLs.
Anyways unicide category Cf is probably what you are looking for, but blocking them is probably wrong as they serve an important function.
– It is unintentional for DuckDuckGo. The code for DuckDuckGo works correctly but no one who wrote that code thought about whether a reversal would happen.
– It is intentional for the browser. The code for the browser works correctly and someone who wrote that code actively thought about how to make a reversal happen.
I don’t think ‘accidental’ is the right word to use in either case because the outcome is what you would want.
So I think it's fair to say that it's not intentional in the sense of being a deliberately added easter egg. Of course, they might be aware of the behavior and decided to leave it that way.
None of these will be shown, but ddg will recognise them as control characters though. https://www.compart.com/en/unicode/category/Cc
So every programmer has to know about and support U+202E, but not filesystem programmers?
Note that U+202E is a control code that has effect on display, not the logical order of the text (much like, say, a bare CR), so I can’t say what the filesystem is doing wrong here (except maybe for not rejecting this outright, but see re smarts above, this probably needs to be done on a higher level). You don’t blame the filesystem for believing the filename "A\rB.txt" starts with A and not B, do you? Even though ls will say otherwise.
Bidi IRIs (which are at that higher level) are kind of horrendous, though.
Replace:
if(bytestring_ends_with(filename, ".exe")) execute_file(...);
By: if(last_displayed_glyphs_equal(filename, ".exe")) execute_file(...); if (!isascii(c)) panic("stupid user"); если (!кои(с)) авост(«тупой оператор»);
You wouldn’t want to live in that world, would you? I know I wouldn’t, and I have that as my native script and most of my filesystem in Latin. I’ve spent my childhood with a computer that ran a VGA-chargen-reprogramming hack at startup and later had to maintain a website stored in an encoding designed to preserve legibility after Latinization through amputation of the 8th bit (in case you’ve ever wondered where the illogical order of KOI-8 comes from). I do not want that world back, however fondly I remember my 286.Specify the dominant direction of your user-input-containing elements, people, and/or enclose the input in U+2068 FSI ... U+2069 PDI (after balancing outstanding bidi controls inside).
The problem is not with Arabic or Hebrew. The problem is that this modifier affects other languages and characters in a way the vast majority of people clearly wouldn't expect (otherwise the story wouldn't make it to the front page).
> Specify the dominant direction of your user-input-containing elements, people, and/or enclose the input in U+2068 FSI ... U+2069 PDI (after balancing outstanding bidi controls inside).
The level of arrogance packed in this sentence is just mind-boggling.
There are many other "Easter eggs" in various basic technologies. I can assure you that no matter how high of an opinion you have about yourself, if you write any production code at all, you are guaranteed to be using something that contains other Easter egg design decisions. You're not aware of them, you're not mitigating them and therefore whether they will explode on you is mostly just a matter of luck.
Minimizing "Easter egg" design decisions is the only long-term viable way to get complexity in our already complex environment under control.
> The level of arrogance packed in this sentence is just mind-boggling.
It’s not arrogance, really, it’s just that I’ve been reading on this exact thing for the last couple of months, and the relevant knowledge is rather unpleasantly smeared over multiple documents in several places (W3C and Unicode.org mostly), so I tried to condense the recipe into a single sentence and drop some terms an interested person could look up: I was attempting to pack information. I see now how that could come off as arrogant, but can’t think of appropriate circumlocutions that could ward that off without turning it into a full bidi-in-HTML tutorial (which I am not qualified to write, for one thing). I already write too many unsolicited tutorials in my comments, this is me trying not to :(
> There are many other "Easter eggs" in various basic technologies. I can assure you that no matter how high of an opinion you have about yourself, if you write any production code at all, you are guaranteed to be using something that contains other Easter egg design decisions. [...]
I’m aware I have limits! I know lots of those! I discover new ones every day!
(I dread the day I need to figure out how an 802.11 retransmission works and how to fight one, for one thing. I can’t do post-2010 JS frontend to save my life, and my database knowledge is somewhere around “there were those guys with the normal form, I think?”. Limits? I’ve got ’em.)
I also expect that once I know about a footgun, I have a responsibility to avoid it, and that people who have just encountered such generally want to hear how to avoid it as well. I’m not entirely competent at the communication part. Sorry.
As to the actual issue... I could say that if you’re handling multilingual text, then you should damn well know how multilingual text works, that it’s not peripheral to your problem.
But I don’t actually believe that, not completely: I think this bidi thing is needlessly hard and we should have directional-stack-balancing and directionality-isolating functions in our standard libraries the same way we have URL-escaping or HTML-quoting ones. Perhaps even have the templating handle most of these cases automatically. It’s like with SQL injection: I don’t have a right to complain people are writing vulnerable queries if we don’t have convenient tools to write correct ones. Unfortunately, in the bidi case, we don’t, so we’ll have to treat this like spun glass until someone makes them.
(That’s part of why I’ve been looking into this so much lately.)
[Previously]
> The problem is not with Arabic or Hebrew. The problem is that this modifier affects other languages and characters in a way the vast majority of people clearly wouldn't expect (otherwise the story wouldn't make it to the front page).
As far as I know, this is not solvable. Or rather, this specific thing is, and the right-to-left override (U+202E RLO) is kind of a screw-up due to this kind of nonlocal effect on surrounding text (it might even be a holdover from the IBM days?), but you can’t design RTL such that it can be ignored by unaware programmers, with or without directional controls. Last I checked (several years ago), a post in Hebrew would wreak considerable destruction on an LTR Facebook news feed, no controls required.
The problem is of distinguishing a white zebra with black stripes from a black zebra with white stripes: Are you looking at RTL text with LTR pieces inside or LTR text with RTL pieces inside? (If you don’t see why this would change the layout, the Unicode Bidirectional Algorithm spec has examples.) What if the pieces themselves include opposite-direction quotes? How do you know where the pieces end in the presence of characters with no intrinsic direction (punctuation, emoji)?
You can encode everything in LTR display order. Your RTL-script users, DBAs, search engine developers, etc. will hate you.
You can require explicit indicators. If this needs to work in plain text (and it does, if Arabic and Hebrew are to do plain text at all, because RTL text requires embedded LTR pieces fairly often), you’ll have to express that in format controls. But then if a user manages drop a right-to-left switch into English text, which couldn’t care less about RTL, the text will get completely messed up and the user gets to complain why RTL influences English. You may try to completely disallow controls in markup that has alternative ways of expressing directionality, but then your input method, your clipboard, etc. needs to know about every possible kind of markup, or every markup processor needs to generate equivalent controls. To at least limit the scope of the disaster, you declare that the effect of the controls ends at a paragraph boundary, but then you need to tell where that is, and the kind of “plain text” you inherited has no good way of distinguishing a mere hard line break from a paragraph terminator except by not-so-plain “protocol” conventions. So you’ll need to guess.
You can ditch explicit indicators and guess. Your processing algorithm will need to know which scripts have which direction, of course, but that’s not a problem. Given the presence of quotations and such in plain text, it’ll also need to learn about paired delimiters and which of them pair with which others, and try to recover when the pairs are wrong or unbalanced, because users are awful. Because of the aforementioned zebra problem, you’ll also need a way to guess which direction of a piece of text is the main one, which seems intractable without godlike NLP, so maybe just take the first character with a definite direction and tell people who start sentences with an opposite-direction fragment they lose? Overall, the whole guessing game becomes so complex it’s completely impossible to reliably embed an arbitrary fragment of user input inside your text unchanged (without inserting visible compensating delimiters, for example), so some kind of format controls that manipulate a stack of directions are called for.
The Unicode design does most of the above; it is complex and could undoubtedly be simpler—there’s like three generations of “no, that’s a bad idea, let’s try again” in there. But it seems like some indication from a programmer that they want to insert this inner thing, that should remain intact, into this outer thing, that shouldn’t get messed up in the process, would be required in any logical-order design at all; you won’t be able to just concatenate byte sequences. It’s acting on that indication that could stand to be easier.
Emojis made it into Unicode because Japanese had custom emoticons, that just had to be brought into Unicode. Then someone discovered them on iOS and they skyrocketed in popularity.
If you want everyone to use Unicode, you truly have to account for everyone's use cases. No exceptions. Even if it means including Emojis, Ancient Egyptian hieroglpyhs[0], or such an irrelevant thing for every language using the latin script as a "RTL override character".
[0] https://en.wikipedia.org/wiki/Egyptian_Hieroglyphs_(Unicode_...
I'd say it's perfect design, with pretty good implementation too.
https://duckduckgo.com/?q=u1f4a9
(Yes, I have that one memorized)
(I think it's unintended though)
What is even more hilarious is if you copy/paste out of the developer tools, that is also backwards after pasting.
Also fun is enumerating all the characters in the Private Character section[2] to see what UI symbols are able to be inserted into unintended places.
[1] https://www.unicode.org/charts/PDF/U0300.pdf
[2] http://www.unicode.org/faq/private_use.html https://www.unicode.org/charts/PDF/UE000.pdf
A bit OT, but here is a classic example of that (the much upvoted stack overflow post on parsing html with regex):
Duckduckgo shows infos about the codepoint and the codepoint itself in a box between the search field and the actual results, and in it, the text is rendered reversed (right to left), because that's what the codepoint tells the browser to do (and DDG doesn't have extra logic yet to either inject another "now render from left to right again" marker, or otherwise prevents it from messing up the info box).
What the DDG link illustrates is that when someone searches for information about that codepoint, DDG's autogenerated answer section accidentally _uses_ that control character (reversing the answer text) instead of just printing the codepoint.
zero_click_wrapper.innerText.codePointAt(0)
Evaluates to 32. And if you think 32 = 0x20 could mean the next one would be 0x2E, then no, codePointAt(1) is 0x55.I'm on the side of this being an unintentional effect.
I'm too under the weather to dig into this, but this might be a mismatch between Firefox and the spec. I don't see in the spec [1][2] where this character could be removed since it shouldn't count as whitespace for whitespace processing.
It looks like in Chrome `innerText` contains the override. And the innerText spec is only 6 or so years old (!) so it wouldn't be too surprising if there were was a lingering incompatibility.
[1] https://html.spec.whatwg.org/multipage/dom.html#the-innertex... [2] https://drafts.csswg.org/css-text/#white-space-processing
Back in the era of forums that didn't support unicode correctly (2005ish?), it was trollish fun to post messages containing \u202E and watch the UI and all subsequent messages and elements get messed up. (One stray \u202E would flip the entire page contents following it.) I never took it to a level of abuse since it was easy to remove and then ban offenders, but it was fun in a one-off thread, and it always had great reactions.
I patched my own software to handle it, but I don't recall anyone really abusing it in a widespread manner. (Contrast this with the era of prolific and widely abused AOL/AIM exploits that would kill your IM client with malformed messages.)
IIRC, a bunch of messaging clients also didn't (or still don't) handle \u202e termination and it sometimes bled into new messages and even the text input box. That was pretty horrible and unfixable without restarting.
Obligatory XKCD: https://xkcd.com/1137/
Some shenanigans in the wild:
https://www.reddit.com/r/Unicode/comments/hc1rxi/i_put_a_rig...
https://twitter.com/mkolsek/status/1237123571341803522
(These are way tamer than the effects used to be.)
(Also, HN filters it out. I tried to have some fun. :P)
data:text/html,<bdo>&%23x202E;reversed</bdo> not reversed
So I guess this should be used to wrap any user-supplied text that allows arbitrary unicode.Or using Unicode:
data:text/html,&%23x2068;&%23x202E;reversed&%23x2069; not reversed
[0] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/bd...Punctuation General :block ,0xAE 0x80 0xE2 :8-UTF ,representation visual No :HTML ,8238 :decimal ,OVERRIDE LEFT-TO-RIGHT 202E+U
Love the demos :)
https://www.reddit.com/r/duckduckgo/comments/sp9e5r/backslas...
Developer: Cosimo Streppone
Developer: mintsoft"
damnit hn
I do not speak a word of Arabic. There is no circumstance in which my life will be materially improved by correct RTL text rendering. I might want proper display of individual characters so I can copy-paste them, but I have no use for RTL text.
On the other hand, RTL causes a lot of unpleasant problems like this. Why can't I simply coerce all foreign languages into LTR?