Confusables.txt and NFKC disagree on 31 characters (opens in new tab)

(paultendo.github.io)

60 pointspimterry2mo ago40 comments

40 comments

Unicode is both the best thing that's ever happened to text encoding and the worst. The approach I take here is to treat any text coming from the user as toxic waste. Assume it will say "Administrator" or "Official Government Employee" or be 800 pixels tall because it was built only out of decorative combining characters. Then put it in a fixed box with overflow hidden, and use some other UI element to convey things like "this is an official account."

The worst part that this article doesn't even touch on with normalizing and remapping characters is the risk your login form doesn't do it but your database does. Suddenly I can re-register an existing account by using a different set of codepoints that the login system doesn't think exists but the auth system maps to somebody else's record.

ElectricalUnion2mo ago

For some sorts of "confusables", you don't even need Unicode in some cases. Depending on the cursed combination of font, kerning, rendering and display, `m` and `rn` are also very hard to distinguish.

kps2mo ago

https://en.wiktionary.org/wiki/keming

chuckadams2mo ago

> or be 800 pixels tall because it was built only out of decorative combining characters

Also known as Zalgo. But it seems most renderers nowadays overlay multiple combining marks over each other rather than stack them, which makes it look far less eldritch than it used to.

joshdata2mo ago

> If your application also runs NFKC normalization (which it should — ENS, GitHub, and Unicode IDNA all require it)

That's not right. Most of the web requires NFC normalization, not NFKC. NFC doesn't lose information in the original string. It reorders and combines code points into equivalent code point sequences, e.g. to simplify equality tests.

In NFKC, the K for "Compatibility" means some characters are replaced with similar, simpler code points. I've found NFKC useful for making text search indexes where you want matches to be forgiving, but it would be both obvious and wrong to use it in most of the web because it would dramatically change what the user has entered. See the examples in https://www.unicode.org/reports/tr15/.

ZoneZealot2mo ago

I think we're expecting too much from an LLM generated article from a user that has been spending a lot of time spamming their content across multiple platforms and websites.

paultendo2mo ago

Thanks Josh - putting this article out there has pushed me to sharpen a lot of my thinking which hopefully should come across in my more recent work. I've updated the article to scope the NFKC recommendation to identifiers and added a note crediting your correction. Thanks for catching it.

bawolff2mo ago

I feel like for search, NFKD and then remove all the combining characters would be a better bet than NFKC.

Of course there are also purpose specific algorithms for preparing text for search that would be even better.

brazzy2mo ago

> The correct use is to check whether a submitted identifier contains characters that visually mimic Latin letters, and if so, reject it

That is a really bad and user-hostile thing to do. Many of those characters are perfectly valid characters in various non-latin scripts. If you want everyone to force Latin script for identifiers, then own up to it and say so. But rejecting just some them for being too similar to latin characters just makes the behaviour inconsistent and confusing for users.

wongarsu2mo ago

What would make sense is to have a blacklist of usernames (like "admin" or "moderator"), then use the confusables map to see if a username or slug is visually confusable with a name from that blacklist.

I initially thought that must surely be what they are doing and they just worded it very, very poorly. But then of the 31 "disagreements" only one matters, the long s that's either f or s. All other disagreements map to visually similar symbols, like O and 0, which you should already treat as the same for this check

Not to mention that this is mostly an issue for URL slugs, so after NFKC normalization. In HTML this is more robustly solved by styling conventions. Even old bb-style forums will display admin and moderator user names in a different color or in bold to show their status. The modern flourish is to put a little icon next to these kinds of names, which also scales well to other identifiers.

orthoxerox2mo ago

The correct approach is to accept [a-z][a-z0-9]* as identifiers and forbid everything else.

skrebbel2mo ago

Yeah fuck foreigners who want to be able to spell their own name right.

tsimionescu2mo ago

In all cultures, there is an expectation that you have to provide a name for yourself that is intelligible to the culture you're interacting with, both in written language and in speech. If your name is Albert and you are going to interact with many Japanese speakers, you'll have to call yourself アルバート in writing and pronounce your name as something like "Ah roo bay toe" to fit in. If you have a name whose pronunciation depends heavily on tones, such as a Mandarin or Vietnamese name, and you are going to interact with speakers of a non-tonal language, you'll have to come up with a version that you're happy with even if pronounced in the default neutral tone that those people will naturally use. If your name is 高山, you'll have to spell it as Takayama.

Similarly, if you're going to create an identifier for yourself that is supposed to be usable in an international context, you'll have to use the lowest common denominator that is acceptable in that context - and that happens to be a-zA-Z0-9. Why the Latin alphabet and numerals and not, say, Arabic, you might ask? Because Chinese and Indian and Arabic speakers are far more likely to be familiar with the Latin alphabet than with each other's writing systems.

1 more reply

diacritical2mo ago

I think restricting the allowed characters should apply to usernames and other unique identifiers that can lead to confusion (admin vs аdmin with a Cyrillic "а"). So if I write my name as "José", I should be able to make an account called "Jose" and still enter "José" in the name field, if such a field exists in the first place. Although I'm not even sure about this.

If you're saying that "José" should be accepted as an username, shouldn't "Борис" or "김" or "金" also be valid?

It makes sense to restrict the alphabet for things like usernames that should be unique, should be easy to read for security reasons and should be correctly handled by various types of backend software.

I'm not from the US and my name isn't ASCII, but I wouldn't mind spelling it with the English alphabet, even in a name field.

I also don't understand how English has 26 letters, but letters like "é" in "José" or "ï" in "naïve" appear as normal letters. And if I write "Jose" instead, it would read as offensive. In my language that uses Cyrillic, the letters of the alphabet are all the letters we use, period. It would just be wrong to borrow a letter from another alphabet, even if it's the same script, just because someone's name includes it in their language. I have a friend from a neighboring country that changed one of his Cyrillic letters when he came to my country. I would do the same if I went to his country and they didn't have a letter we have.

kgeist2mo ago

For logins, we're already used to the fact that they're expected to be in Latin. Having them in the native alphabet is more trouble than it's worth (one system supports it, another breaks etc., easier to remember one, in Latin, across systems) I'd be irritated though if I couldn't use my native alphabet in the user profile for the first name/last name

1 more reply

silon422mo ago

As someone with non-ASCII name, I'd like a unicode whitelist (system wide if possible).

And special features to mark cyrillic or other for-me-dangerous characters.

Zardoz842mo ago

And you pissed off nearly half of the world population.

paultendo2mo ago

I agree that rejecting valid non-Latin characters in valid contexts is user-hostile, but I should be clearer about scope: this is specifically about machine-readable identifiers (slugs, handles, ENS names) where the character set is intentionally restricted, not display names or user-facing text.

The approach there should be what wongarsu describes below (imo), to style the UI so official accounts are visually distinct (badges, colour, etc.) rather than policing the character set.

namespace-guard is deliberately opinionated for the slug/handle case where you've already decided the output should be ASCII-safe. If your use case is broader than that, confusables detection without rejection is the right call.

Liftyee2mo ago

Does the "removing dead code" advantage outweigh the additional complexity of having to maintain 2 different confusables lists: one for when NFKC has been applied first and one without? It didn't sound like applying one after the other caused any errors, just that some previously reachable states are unreachable.

lich_king2mo ago

This is an inexplicable, AI-written article and the obvious answer is no. There's no performance or complexity overhead to not removing a couple of dead characters. There is a complexity overhead to forking off the list or adding pointless special cases to your code.

happytoexplain2mo ago

Tangential - I'm aware of various types of, let's say, "swappability" that Unicode defines (broader than the Unicode concept of "equivalence"):

- Canonical (NF)

- Compatible (NFK)

- Composed vs decomposed

- Confusable (confusables.txt)

Does Unicode not define something like "fuzzy" equivalence? Like "confusable" but more broad, for search bar logic? The most obvious differences would be case and diacritic insensitivity (e, é). Case is easy since any string/regex API supports case insensitivity, but diacritic insensitivity is not nearly as common, and there are other categories of fuzzy equivalence too (e.g. ø, o).

I guess it makes sense for Unicode to not be interested in defining something like this, since it relates neither to true semantics nor security, but it's an incredibly common pattern, and if they offered some standard, I imagine more APIs would implement it.

bawolff2mo ago

I think UCA using a collation tailored for search would be the closest to what you are looking for

kccqzy2mo ago

If you allow users to submit arbitrary Unicode string as text, why would you need to check confusables.txt? Whose confusion are you guarding against?

zahlman2mo ago

I suppose: other users, if you store the first user's text and transmit it to another one.

kccqzy2mo ago

Well then it’s a failure of UI design if you think this can cause confusion. In any UGC design it should be extremely clear which text is generated by another user and which belongs to the site itself.

netsharc2mo ago

What if a user with the name kссqzу (k[Cyrillic c][Cyrillic c]qz[Cyrillic y]) pretends to be you, sends your friend a PM and extracts a secret out of them?

1 more reply

zahlman2mo ago

No, no. The problem is, say you operate a forum; a malicious user makes a post that uses a Unicode confusion attack on a URL to direct other forum members to an attack site (e.g. a phishing site).

rurban2mo ago

That's a user pipeline problem. If you just check confusables without any tr39 algo you will be disappointed also. I had to patch confusables for the C/C++ proposal for about 10 characters also.

"...the default confusables list is extremely buggy. It needs at least 7 manual exceptions for the ASCII range, 12 exceptions for Greek, and I didn’t check any others scripts. python and clang-tidy were very unsuccessful with this approach, compared to java, rust and cperl with the mixed-script approach." https://rurban.github.io/libu8ident/#confusables

In detail: https://rurban.github.io/libu8ident/doc/D2528R1.html at 10 TR39 Mixed Scripts

csense2mo ago

My theory: The "long S" in "Congreſs" is an f. They used f instead of s because without modern dental care, a lot of people in the 1600's and 1700's were miffing teeth and fpoke with a lifp.

nkrisc2mo ago

https://en.wikipedia.org/wiki/Long_s

That’s not the case.

advisedwang2mo ago

You should tell ChatGPT your theory, then maybe you'll find someone that thinks it's worthwhile.

genodethrowaway2mo ago

tough crowd

j / k navigate · click thread line to collapse

40 comments

akersten2mo ago

ElectricalUnion2mo ago

kps2mo ago

https://en.wiktionary.org/wiki/keming

chuckadams2mo ago

> or be 800 pixels tall because it was built only out of decorative combining characters

Also known as Zalgo. But it seems most renderers nowadays overlay multiple combining marks over each other rather than stack them, which makes it look far less eldritch than it used to.

joshdata2mo ago

> If your application also runs NFKC normalization (which it should — ENS, GitHub, and Unicode IDNA all require it)

ZoneZealot2mo ago

I think we're expecting too much from an LLM generated article from a user that has been spending a lot of time spamming their content across multiple platforms and websites.

paultendo2mo ago

bawolff2mo ago

I feel like for search, NFKD and then remove all the combining characters would be a better bet than NFKC.

Of course there are also purpose specific algorithms for preparing text for search that would be even better.

brazzy2mo ago

> The correct use is to check whether a submitted identifier contains characters that visually mimic Latin letters, and if so, reject it

wongarsu2mo ago

orthoxerox2mo ago

The correct approach is to accept [a-z][a-z0-9]* as identifiers and forbid everything else.

skrebbel2mo ago

Yeah fuck foreigners who want to be able to spell their own name right.

tsimionescu2mo ago

1 more reply

diacritical2mo ago

If you're saying that "José" should be accepted as an username, shouldn't "Борис" or "김" or "金" also be valid?

I'm not from the US and my name isn't ASCII, but I wouldn't mind spelling it with the English alphabet, even in a name field.

kgeist2mo ago

1 more reply

silon422mo ago

As someone with non-ASCII name, I'd like a unicode whitelist (system wide if possible).

And special features to mark cyrillic or other for-me-dangerous characters.

Zardoz842mo ago

And you pissed off nearly half of the world population.

paultendo2mo ago

The approach there should be what wongarsu describes below (imo), to style the UI so official accounts are visually distinct (badges, colour, etc.) rather than policing the character set.

Liftyee2mo ago

lich_king2mo ago

happytoexplain2mo ago

Tangential - I'm aware of various types of, let's say, "swappability" that Unicode defines (broader than the Unicode concept of "equivalence"):

- Canonical (NF)

- Compatible (NFK)

- Composed vs decomposed

- Confusable (confusables.txt)

bawolff2mo ago

I think UCA using a collation tailored for search would be the closest to what you are looking for

kccqzy2mo ago

If you allow users to submit arbitrary Unicode string as text, why would you need to check confusables.txt? Whose confusion are you guarding against?

zahlman2mo ago

I suppose: other users, if you store the first user's text and transmit it to another one.

kccqzy2mo ago

netsharc2mo ago

What if a user with the name kссqzу (k[Cyrillic c][Cyrillic c]qz[Cyrillic y]) pretends to be you, sends your friend a PM and extracts a secret out of them?

1 more reply

zahlman2mo ago

No, no. The problem is, say you operate a forum; a malicious user makes a post that uses a Unicode confusion attack on a URL to direct other forum members to an attack site (e.g. a phishing site).

rurban2mo ago

That's a user pipeline problem. If you just check confusables without any tr39 algo you will be disappointed also. I had to patch confusables for the C/C++ proposal for about 10 characters also.

In detail: https://rurban.github.io/libu8ident/doc/D2528R1.html at 10 TR39 Mixed Scripts

csense2mo ago

My theory: The "long S" in "Congreſs" is an f. They used f instead of s because without modern dental care, a lot of people in the 1600's and 1700's were miffing teeth and fpoke with a lifp.

nkrisc2mo ago

https://en.wikipedia.org/wiki/Long_s

That’s not the case.

advisedwang2mo ago

You should tell ChatGPT your theory, then maybe you'll find someone that thinks it's worthwhile.

genodethrowaway2mo ago

tough crowd

j / k navigate · click thread line to collapse