The worst part that this article doesn't even touch on with normalizing and remapping characters is the risk your login form doesn't do it but your database does. Suddenly I can re-register an existing account by using a different set of codepoints that the login system doesn't think exists but the auth system maps to somebody else's record.
Also known as Zalgo. But it seems most renderers nowadays overlay multiple combining marks over each other rather than stack them, which makes it look far less eldritch than it used to.
That's not right. Most of the web requires NFC normalization, not NFKC. NFC doesn't lose information in the original string. It reorders and combines code points into equivalent code point sequences, e.g. to simplify equality tests.
In NFKC, the K for "Compatibility" means some characters are replaced with similar, simpler code points. I've found NFKC useful for making text search indexes where you want matches to be forgiving, but it would be both obvious and wrong to use it in most of the web because it would dramatically change what the user has entered. See the examples in https://www.unicode.org/reports/tr15/.
Of course there are also purpose specific algorithms for preparing text for search that would be even better.
That is a really bad and user-hostile thing to do. Many of those characters are perfectly valid characters in various non-latin scripts. If you want everyone to force Latin script for identifiers, then own up to it and say so. But rejecting just some them for being too similar to latin characters just makes the behaviour inconsistent and confusing for users.
I initially thought that must surely be what they are doing and they just worded it very, very poorly. But then of the 31 "disagreements" only one matters, the long s that's either f or s. All other disagreements map to visually similar symbols, like O and 0, which you should already treat as the same for this check
Not to mention that this is mostly an issue for URL slugs, so after NFKC normalization. In HTML this is more robustly solved by styling conventions. Even old bb-style forums will display admin and moderator user names in a different color or in bold to show their status. The modern flourish is to put a little icon next to these kinds of names, which also scales well to other identifiers.
Similarly, if you're going to create an identifier for yourself that is supposed to be usable in an international context, you'll have to use the lowest common denominator that is acceptable in that context - and that happens to be a-zA-Z0-9. Why the Latin alphabet and numerals and not, say, Arabic, you might ask? Because Chinese and Indian and Arabic speakers are far more likely to be familiar with the Latin alphabet than with each other's writing systems.
If you're saying that "José" should be accepted as an username, shouldn't "Борис" or "김" or "金" also be valid?
It makes sense to restrict the alphabet for things like usernames that should be unique, should be easy to read for security reasons and should be correctly handled by various types of backend software.
I'm not from the US and my name isn't ASCII, but I wouldn't mind spelling it with the English alphabet, even in a name field.
I also don't understand how English has 26 letters, but letters like "é" in "José" or "ï" in "naïve" appear as normal letters. And if I write "Jose" instead, it would read as offensive. In my language that uses Cyrillic, the letters of the alphabet are all the letters we use, period. It would just be wrong to borrow a letter from another alphabet, even if it's the same script, just because someone's name includes it in their language. I have a friend from a neighboring country that changed one of his Cyrillic letters when he came to my country. I would do the same if I went to his country and they didn't have a letter we have.
And special features to mark cyrillic or other for-me-dangerous characters.
The approach there should be what wongarsu describes below (imo), to style the UI so official accounts are visually distinct (badges, colour, etc.) rather than policing the character set.
namespace-guard is deliberately opinionated for the slug/handle case where you've already decided the output should be ASCII-safe. If your use case is broader than that, confusables detection without rejection is the right call.
- Canonical (NF)
- Compatible (NFK)
- Composed vs decomposed
- Confusable (confusables.txt)
Does Unicode not define something like "fuzzy" equivalence? Like "confusable" but more broad, for search bar logic? The most obvious differences would be case and diacritic insensitivity (e, é). Case is easy since any string/regex API supports case insensitivity, but diacritic insensitivity is not nearly as common, and there are other categories of fuzzy equivalence too (e.g. ø, o).
I guess it makes sense for Unicode to not be interested in defining something like this, since it relates neither to true semantics nor security, but it's an incredibly common pattern, and if they offered some standard, I imagine more APIs would implement it.
"...the default confusables list is extremely buggy. It needs at least 7 manual exceptions for the ASCII range, 12 exceptions for Greek, and I didn’t check any others scripts. python and clang-tidy were very unsuccessful with this approach, compared to java, rust and cperl with the mixed-script approach." https://rurban.github.io/libu8ident/#confusables
In detail: https://rurban.github.io/libu8ident/doc/D2528R1.html at 10 TR39 Mixed Scripts
That’s not the case.