Plagiarized news sites are using Cyrillic characters to avoid detection (opens in new tab)

(hoax-alert.leadstories.com)

92 pointsmschenk8y ago29 comments

29 comments

25 comments · 12 top-level

filleokus8y ago· 4 in thread

In Sweden (and probably other places), a service called URKUND[0] ("deed" in Swedish) is used for automatic detection of plagiarism for school work.

I have always wondered to what extent they identify stuff like this, and other potential trickery with UTF-8 or removing text layers from PDF files.

0: http://www.urkund.com/en/

calineczka8y ago

Students in PL used invisible whitespace characters to avoid detection.

netsharc8y ago

A neat trick would be to render the PDF into an image file, and then OCR that image file, and do the detection using the text file generated by the OCR process.

Just like VW's defeat device, the cheater would then need to create software that outputs something different when they see they are being rendered not to display, but to an image file...

pbhjpbhj8y ago

So, one needs to use a font that OCR finds hard to avoid detection. Would rubbish keming do the trick, you could abide by imposed don't requirements but change kerning/leading?

3 more replies

taneq8y ago

Would it not be easier to just have a lookup table which maps unicode points to visually equivalent glyphs?

haneefmubarak8y ago· 2 in thread

I think at some point, a sort of visual-normalization that converts similar looking unicode to a single unique string sequence (ex: convert certain letters from Cyrillic and other language sets that are also present in Latin to just Latin) is just going to be necessary as a security precaution.

Given the whole "fake news" thing over the past couple of years, I expect that the first step will be taken by one of Google/Twitter/Facebook/etc, but I hope that they (or someone else) releases a library (or worst case, an online API) that allows this sort of normalization for security verification. I get that having it open would make it easier for people to find loopholes by brute-force testing, but these sorts of loopholes could also be patched rather quickly as they came up, providing benefit to everyone (especially from a security perspective).

EDIT: Perhaps this could start out as a series of matches generated using ML classification? I don't know much about ML - does anyone who does think this is a realistic starting point?

sedev8y ago

Unicode normalization/equivalence is half of what you want and UCAPI is probably the other half.

https://en.wikipedia.org/wiki/Unicode_equivalence https://www.casaba.com/products/UCAPI/

haneefmubarak8y ago

Those are interesting, especially the second one. I've read a little here and there about Unicode normalization before, but UCAPI does look like what I really wanted. However, seeing as UCAPI isn't free or even "listed price plans", I get the strong feeling that this will not see much pickup (at least until someone makes a free one).

1 more reply

mfoy_8y ago· 2 in thread

>The site is part of a growing list of fake Native American pages run out of places like Macedonia, Kosovo or Vietnam.

So the headline is a little misleading... It's just that there are a growing number of websites that simply plagiarize content to get views / ad revenue. Because their titles are obfuscated to prevent detection of the plagiarism, they have to target specific niche groups to drive views. So it's not some weird "fake Native American" scheme / scam / ploy... it's just that this site in particular seems to focus on "Native American topics".

So it's not "Fake Native Americans Are Using Russian Characters to Avoid Plagiarism Detectors", it's "Fake News Sites Plagiarize Articles by Using Cyrillic Character Replacement to Avoid Detection", subtitle: "One such site targets Native Americans!"

dmix8y ago

This isn't even fake news. It's just blogspam for advertising dollars. Nothing fake about it... the articles they copy are real content.

This is just jumping on trendy words like Russians/bots/fake news for click bait.

coldcode8y ago

Can we get the title updated?

kozak8y ago· 2 in thread

Let me nitpick a bit: these characters are not Russian, they are Cyrillic. There are some Cyrillic characters that are distinctly Russian (i.e. used only in the Russian language), but these characters can't impersonate Latin letters because they are too different from them.

https://en.wikipedia.org/wiki/Cyrillic_script

moufestaphio8y ago

Yeah that bothered me as well.

Would you write: "Site uses American characters to publish fake news." ?

dwighttk8y ago

you gotta loop the Russians in to drive page views

iluxonchik8y ago· 1 in thread

I doubt websites hosted in Eastern Europe care about copyright legal threats. Even if they contact the hosting provider directly I doubt any action will be taken. Eastern Europe has plenty of cheap, shady hosting providers where you can host pretty much anything that you want. Unless the website is making a lot of money, nobody is going to spend significant resources to take those websites down.

Let me speculate of why they might be doing it. Google will de-rank pages that have content that's identical to others (e.g. identical paragraphs of text). Maybe Facebook is doing something similar?

Let's say one of your friends shares an article from BuzzFeed.com, then another friend shares an exact clone of this article from FakeBuzzFeed.com. Now, Facebook might not want to show two articles with the same title from two different websites on your timeline. And considering that BuzzFeed.com is a website with a higher ranking that FakeBuzzFeed.com, it will probably choose to display only the fist one. If you do the Cyrillic trick to the article in FakeBuzzFeed.com, Facebook will think it's something completely different and present it to you, thus getting you a higher reach.

The same applies to the advertising part: if you're constantly submitting page ads with exactly the same titles as the one's that real users are sharing, it might get you banned.

RustGirl8y ago

I think you're right. Also news stories tend to have a currency. By the time a copyright owner complains, the news is old news. They don't want some automatic process to mark their content as "spammy" and not feature their links or content if someone pushes it.

Cynddl8y ago· 1 in thread

A bit of extrapolation here. In short, a few websites dedicated to make easy money on Facebook by copying articles have started to use unicode to obfuscate the title. They automatically replace latin characters with similar letters.

This makes their title harder to be detected by either Facebook, fact checking websites, or DMCA/copyright bots. Nothing related to Russia here.

mfoy_8y ago

Or Native Americans.

goptimize8y ago· 1 in thread

Maarten Schenk (resident expert on fake news) create click-bait titles

dmix8y ago

It's not even fake news and even plagiarism is a weak description. It's blogspam / content farming for cheap advertising dollars by eastern europeans. This has been going on for ages, much like SEO gaming, it's just using another hacky trick, and using Cyrillic characters is hardly new trick either.

AndrewNCarr8y ago

Here is a project that maintains a list of homoglyphs and has some Java and Javascript code for detecting them.

https://github.com/codebox/homoglyph

The list itself in sorted text format, each line a list of similar glyphs:

https://github.com/codebox/homoglyph/blob/master/raw_data/ch...

1 more reply

frits19938y ago

This reminds me of a project back in 2014, where a school-mate and I created an "uncopyable" font using the same idea.

I put the site back online at http://nopy.progresso-ict.nl/ ($10 PayPal money has already been given away years ago)

beager8y ago

Should be easy enough for networks to detect and remove these, by identifying content where character ranges in words routinely fall outside the charsets of languages.

That, or some sort of fuzzy CV hashing, which is cool, but more intensive. That would also mitigate null length and invisible modifiers.

smsm428y ago

This is an old trick, successfully used for a while in domain names (does gооglе.com look suspicious to you? what if it had a valid SSL certificate?) but hopefully all browsers and registrars have smarted up by now.

Another version of this trick has been popular in Russia with corrupt government workers: by law, a lot of government purchase/service contracts should be subject to public calls for bids, usually placed in a website which you can search. However, if you write what you need replacing some of Cyrillic characters with Latin ones, a honest supplier that is looking for a government contract will never find your entry. However a corrupt one that you have arranged with beforehand would, and will be the sole bidder on this contract, with a price that you have arranged before (which of course includes a juicy cut for the corrupt government official) and nobody is the wiser, all requirements of the law are fulfilled, who could be blamed that there's only one bidder?

BanzaiTokyo8y ago

substitution of characters o/a/e (that are similar in Latin and Cyrillic alphabets) has been used for years to pass automatic plagiarism detectors.

j / k navigate · click thread line to collapse

29 comments

25 comments · 12 top-level

filleokus8y ago· 4 in thread

In Sweden (and probably other places), a service called URKUND[0] ("deed" in Swedish) is used for automatic detection of plagiarism for school work.

I have always wondered to what extent they identify stuff like this, and other potential trickery with UTF-8 or removing text layers from PDF files.

0: http://www.urkund.com/en/

calineczka8y ago

Students in PL used invisible whitespace characters to avoid detection.

netsharc8y ago

A neat trick would be to render the PDF into an image file, and then OCR that image file, and do the detection using the text file generated by the OCR process.

Just like VW's defeat device, the cheater would then need to create software that outputs something different when they see they are being rendered not to display, but to an image file...

pbhjpbhj8y ago

So, one needs to use a font that OCR finds hard to avoid detection. Would rubbish keming do the trick, you could abide by imposed don't requirements but change kerning/leading?

3 more replies

taneq8y ago

Would it not be easier to just have a lookup table which maps unicode points to visually equivalent glyphs?

haneefmubarak8y ago· 2 in thread

EDIT: Perhaps this could start out as a series of matches generated using ML classification? I don't know much about ML - does anyone who does think this is a realistic starting point?

sedev8y ago

Unicode normalization/equivalence is half of what you want and UCAPI is probably the other half.

https://en.wikipedia.org/wiki/Unicode_equivalence https://www.casaba.com/products/UCAPI/

haneefmubarak8y ago

1 more reply

mfoy_8y ago· 2 in thread

>The site is part of a growing list of fake Native American pages run out of places like Macedonia, Kosovo or Vietnam.

dmix8y ago

This isn't even fake news. It's just blogspam for advertising dollars. Nothing fake about it... the articles they copy are real content.

This is just jumping on trendy words like Russians/bots/fake news for click bait.

coldcode8y ago

Can we get the title updated?

kozak8y ago· 2 in thread

https://en.wikipedia.org/wiki/Cyrillic_script

moufestaphio8y ago

Yeah that bothered me as well.

Would you write: "Site uses American characters to publish fake news." ?

dwighttk8y ago

you gotta loop the Russians in to drive page views

iluxonchik8y ago· 1 in thread

Let me speculate of why they might be doing it. Google will de-rank pages that have content that's identical to others (e.g. identical paragraphs of text). Maybe Facebook is doing something similar?

The same applies to the advertising part: if you're constantly submitting page ads with exactly the same titles as the one's that real users are sharing, it might get you banned.

RustGirl8y ago

Cynddl8y ago· 1 in thread

This makes their title harder to be detected by either Facebook, fact checking websites, or DMCA/copyright bots. Nothing related to Russia here.

mfoy_8y ago

Or Native Americans.

goptimize8y ago· 1 in thread

Maarten Schenk (resident expert on fake news) create click-bait titles

dmix8y ago

AndrewNCarr8y ago

Here is a project that maintains a list of homoglyphs and has some Java and Javascript code for detecting them.

https://github.com/codebox/homoglyph

The list itself in sorted text format, each line a list of similar glyphs:

https://github.com/codebox/homoglyph/blob/master/raw_data/ch...

1 more reply

frits19938y ago

This reminds me of a project back in 2014, where a school-mate and I created an "uncopyable" font using the same idea.

I put the site back online at http://nopy.progresso-ict.nl/ ($10 PayPal money has already been given away years ago)

beager8y ago

Should be easy enough for networks to detect and remove these, by identifying content where character ranges in words routinely fall outside the charsets of languages.

That, or some sort of fuzzy CV hashing, which is cool, but more intensive. That would also mitigate null length and invisible modifiers.

smsm428y ago

BanzaiTokyo8y ago

substitution of characters o/a/e (that are similar in Latin and Cyrillic alphabets) has been used for years to pass automatic plagiarism detectors.

j / k navigate · click thread line to collapse