$ python
Python 3.7.0 (default, Jul 22 2018, 21:11:34)
[Clang 9.1.0 (clang-902.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata as ud
>>> ud.normalize('NFKD', '''It's a kinda ascii-art thing that ๐ฉ๐ข๐ฑ๐ฐ ๐ถ๐ฌ๐ฒ ๐๐ซ๐ฐ๐ด๐ข๐ฏ ๐ฅ๐ซ ๐ ๐ฌ๐ช๐ช๐ข๐ซ๐ฑ๐ฐ ๐๐ฉ๐ฉ ๐ฃ๐๐ซ๐ ๐ถ ๐ฉ๐ฆ๐จ๐ข ๐ฑ๐ฅ๐ฆ๐ฐ. ๐๐ฃ ๐๐๐๐ ๐ฅ๐๐๐ค ๐๐ ๐ช๐ ๐ฆ ๐จ๐๐๐ฅ ๐๐๐ค๐ค ๐๐ ๐ฅ๐๐๐ ๐๐ ๐ฃ๐ ๐ ๐ฆ๐ฅ๐๐๐๐.
...
... ๐ธ๐๐'๐ ๐๐๐๐๐๐๐ ๐๐๐๐๐?''')
"It's a kinda ascii-art thing that lets you answer hn comments all fancy like this. Or like this if you want less gothic more outline.\n\nIsn't unicode great?"
>>>
You'd hope a screen reader would have more effort put into it than a 3 second read of a HN thread?VoiceOver on iOS doesnโt speak it, either.
Badly-implemented screen readers don't do very well with this. The Unicode Standard provides a Normalization Form for Compatibility Decomposition (NFKD / NFKC) that screen readers definitely should adopt in their Unicode implementation [1].
This is called a Homoglyph attack.
If you accept unicode for strings that should be "unique" (eg username), there are various normalization schemes that basically convert equivalent-ish looking characters into a consistent hash.
I have no doubt spam filters use this.
The whole point of having Mathematical Alphanumeric Symbols as separate unicode code points, rather than just using normal latin characters with style markup, is so they can be used when the different letters have semantically different meanings -- in particular in maths when ๐น and ๐ can be in the same formula, representing different concepts. They're not a replacement for style markup.
In other words, they're different characters specifically so that screen readers can know to read them out loud differently!
Trying to 'fix' screenreaders by having them read anything that them as if they were normal latin characters, to accommodate people who like using the Mathematical Symbols block for fun in places which only allow plain text, would completely defeat the actual purpose of them.
๐ธ๐๐'๐ ๐๐๐๐๐๐๐ ๐๐๐๐๐?
Am i supposed to normalize ALL untrusted user input or will that break normal text in some language i'm not familiar with? Or only normalize things that are supposed to be unique, like urls, usernames and other identifiers?
I discovered that by accident.
Very neat utility, though!
P.S.: It's look like you reinvent YayText[3] website ;-)
[0] http://adamvarga.com/strike/
[1] https://yaytext.com/strike/
I learned something new today, thanks!
๐ธ ๐น ๐บ ๐ป ๐ผ ๐ฝ ๐พ ๐ฟ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐
๐๐ฌ๐ฏ๐ช๐๐ฑ ๐ฑ๐ข๐ต๐ฑ ๐ฒ๐ฐ๐ฆ๐ซ๐ค ๐ฒ๐ซ๐ฆ๐ ๐ฌ๐ก๐ข ๐ ๐ฅ๐๐ฏ๐๐ ๐ฑ๐ข๐ฏ๐ฐ. ๐๐๐ฐ๐ฑ๐ข ๐๐ซ๐ถ๐ด๐ฅ๐ข๐ฏ๐ข ๐ฑ๐ฅ๐๐ฑ ๐๐ ๐ ๐ข๐ญ๐ฑ๐ฐ ๐ญ๐ฉ๐๐ฆ๐ซ ๐ฑ๐ข๐ต๐ฑ.
For example, if I use the tool to make a url italic, then pasting that url into Chrome's url bar gives me back a bunch of unicode rectangles.
But that's not what I wanted. I wanted Chrome's url bar to interpret those unicode code points as an italic version of the actual unicode code points I want. Chrome should add a check for edge cases like these and add branches to map the string to the corresponding non-styled code points automatically.
Someone needs to send lots of bug reports to all the relevant pieces of software that currently have this bug. Firefox, Chromium, probably Edge, Webkit. Those are just the browsers, but I'm sure there are more. I'm not actually sure about Firefox tbh, but maybe just send the bug report first and see if it gets accepted to find out.
Ooh, here's another one-- if you paste some unicode.style'd text into LibreOffice does it convert it to the "normal" code points and add the relevant styling? If not, it should, otherwise it's broken.
Actually, I just realized another issue. If I type something in the url bar that is styled with unicode.style, then there is no way for Chrome to know whether I want it displayed styled or not.
For example, maybe I'm pasting it there temporarily so that I can copy/paste it later in a Tweet. In that case I probably want to keep the current styling for the tweet.
So Chrome should map to the normalized unicode code points (just in case I'm typing a url or want to instantiate a search), but still display the styled version. Then when I copy it again, it should put the unicode.style version into the buffer. And the app which receives the pasted buffer should receive the unicode.style code points. And of course that app should also normalize it underneath while retaining the styled display for the same reasons.
To deal with this complexity, there should probably be a standardized way for all apps to deal with styled text.
Please help by testing every app and filing relevant bug reports.
No, it shouldn't. They are semantically different code points. The _whole point_ is that they are semantically different code points (they're from the Mathematical Alphanumeric Symbols block, the purpose of which is for e.g. when you have a formula containing ๐น and ๐ as semantically different characters, where that difference needs to be preserved in copying & pasing, conveyed to screen readers, etc.
> Ooh, here's another one-- if you paste some unicode.style'd text into LibreOffice does it convert it to the "normal" code points and add the relevant styling? If not, it should, otherwise it's broken.
It really, really, really shouldn't, for the same reason as above.
Mathematical symbols are not a replacement for text styles and markup. Trying to make them that will destroy the thing they're actually useful for, the thing they can do that text styling can't do: preserve their semantics when transmitted in plain text (including for accessibility purposes).
This is not styling. It cannot be normalized away. They are semantically different characters.
"ABCD" and "๐๐๐๐" are the same thing, but they also aren't. Am I supposed to normalize everything on username creation to prevent people from making duplicates?
i8n is complicated because the diversity of language is itself complicated. I feel your pain, though, don't get me wrong.
You probably never want to allow unrestricted use of any character set for a username, even ASCII - otherwise I could take the username 'bjt2n3904 '.
Did Unicode ever have any business assigning separate code points for italic versions of latin glyphs? Am I ignorant as to their true purpose?
EDIT: SEMI just answered my second question. Math. Makes sense.
Here's the official answer to this question from the Unicode spec [1]:
"Mathematical notation requires a number of Latin and Greek alphabets that initially appear to be mere font variations of one another. For example, the letter H can appear as plain or upright (H), bold (๐), italic (๐ป), and script (โ). However, in any given document, these characters have distinct, and usually unrelated, mathematical semantics. For example, a normal H represents a different variable from a bold H, etc. If these attributes are dropped in plain text, the distinctions are lost and the meaning of the text is altered."
[1] https://www.unicode.org/versions/Unicode11.0.0/UnicodeStanda...
Thanks! I started by interpreting the tool as a serious and important feature of putting styling where UIs don't allow it. Then I did a quick stream-of-consciousness advocacy for fitting the feature in all the places where it isn't allowed.
The results are obviously bad, which is reassuring.