I have always wondered to what extent they identify stuff like this, and other potential trickery with UTF-8 or removing text layers from PDF files.
Just like VW's defeat device, the cheater would then need to create software that outputs something different when they see they are being rendered not to display, but to an image file...
Given the whole "fake news" thing over the past couple of years, I expect that the first step will be taken by one of Google/Twitter/Facebook/etc, but I hope that they (or someone else) releases a library (or worst case, an online API) that allows this sort of normalization for security verification. I get that having it open would make it easier for people to find loopholes by brute-force testing, but these sorts of loopholes could also be patched rather quickly as they came up, providing benefit to everyone (especially from a security perspective).
EDIT: Perhaps this could start out as a series of matches generated using ML classification? I don't know much about ML - does anyone who does think this is a realistic starting point?
https://en.wikipedia.org/wiki/Unicode_equivalence https://www.casaba.com/products/UCAPI/
So the headline is a little misleading... It's just that there are a growing number of websites that simply plagiarize content to get views / ad revenue. Because their titles are obfuscated to prevent detection of the plagiarism, they have to target specific niche groups to drive views. So it's not some weird "fake Native American" scheme / scam / ploy... it's just that this site in particular seems to focus on "Native American topics".
So it's not "Fake Native Americans Are Using Russian Characters to Avoid Plagiarism Detectors", it's "Fake News Sites Plagiarize Articles by Using Cyrillic Character Replacement to Avoid Detection", subtitle: "One such site targets Native Americans!"
Would you write: "Site uses American characters to publish fake news." ?
Let me speculate of why they might be doing it. Google will de-rank pages that have content that's identical to others (e.g. identical paragraphs of text). Maybe Facebook is doing something similar?
Let's say one of your friends shares an article from BuzzFeed.com, then another friend shares an exact clone of this article from FakeBuzzFeed.com. Now, Facebook might not want to show two articles with the same title from two different websites on your timeline. And considering that BuzzFeed.com is a website with a higher ranking that FakeBuzzFeed.com, it will probably choose to display only the fist one. If you do the Cyrillic trick to the article in FakeBuzzFeed.com, Facebook will think it's something completely different and present it to you, thus getting you a higher reach.
The same applies to the advertising part: if you're constantly submitting page ads with exactly the same titles as the one's that real users are sharing, it might get you banned.
This makes their title harder to be detected by either Facebook, fact checking websites, or DMCA/copyright bots. Nothing related to Russia here.
https://github.com/codebox/homoglyph
The list itself in sorted text format, each line a list of similar glyphs:
https://github.com/codebox/homoglyph/blob/master/raw_data/ch...
I put the site back online at http://nopy.progresso-ict.nl/ ($10 PayPal money has already been given away years ago)
That, or some sort of fuzzy CV hashing, which is cool, but more intensive. That would also mitigate null length and invisible modifiers.
Another version of this trick has been popular in Russia with corrupt government workers: by law, a lot of government purchase/service contracts should be subject to public calls for bids, usually placed in a website which you can search. However, if you write what you need replacing some of Cyrillic characters with Latin ones, a honest supplier that is looking for a government contract will never find your entry. However a corrupt one that you have arranged with beforehand would, and will be the sole bidder on this contract, with a price that you have arranged before (which of course includes a juicy cut for the corrupt government official) and nobody is the wiser, all requirements of the law are fulfilled, who could be blamed that there's only one bidder?