Google doesn't recognise or penalise stolen content (opens in new tab)

(pi-datametrics.com)

98 pointsollieglass10y ago77 comments

77 comments

45 comments · 13 top-level

bpodgursky10y ago· 10 in thread

I think this is pretty fair on Google's part. How could you possibly figure out who owned content?

What if I published a book, it was copy-pasted in blogs, and then later I put it somewhere crawlable by Google? You certainly can't just say "first time we saw it, that's the proper owner". It would either require a massive amount of manual QA to get right (and even then, there are going to be interminable copyright battles), or have a super high error rate.

I think Google's best value is letting proper content owners easily find violators via normal searches, and let them deal with them via takedown notices or the court system -- which is where it should be done, not in a pseudo-court run by a Google who does not want what responsibility.

ChuckMcM10y ago

So back when Blekko was a consumer search engine we could 100% figure out who owned content on sites we crawled often. And even when we didn't we could often guess correctly more often than not based on the domain registration dates. (not to mention registry owners). That is because few people who rip off content rip off just one web site, they will rip off dozens of web sites and they will all share the same AdSense ids and the same domain registrar. This is easy stuff to spot when you crawl the web regularly.

I suspect that Google simply doesn't care. They get Ad revenue regardless and in their laissez-faire editorial position it doesn't matter. What are you going to do, use another search engine?

sounds10y ago

Or, more likely, they can't get involved for legal reasons. If they took steps to block the easy stuff, an arms race would ensue, and the content providers would never be satisfied with the performance being provided for free by Google. The content providers would always demand stricter enforcement, and could threaten to sue for copyright infringement regardless of merit.

1 more reply

loceng10y ago

Do you know if any search engine is actively filtering for this?

1 more reply

alwaysdoit10y ago

Also all of this is assuming that any duplicated content is inherently stolen, when it could in fact be public domain, fair use, legitimately licensed, distributed under Creative Commons, etc.

blfr10y ago

True, but in these cases I would still want original/canonical/fastest/best source first while the others are probably only valuable as backups.

PaulHoule10y ago

People have been complaining about this problem for a LONG time.

Duplicate removal is essential for making a web search engine that works. For instance, together with a CS research group, I built a search engine for a major university library that had more than 80 web sites. We found huge amounts of duplicate content produced by various mechanisms (for instance, multiple people posted the same stuff to the web.) If your ranking is content-based, all of the duplicate documents are going to rank the same and form a "plug" that excludes other documents.

It has long (post 2006) been a common story that "I wrote a blog post but somebody else ranks for it." For instance, I made a blog post that got a huge amount of traffic in the day, but right now you search for it and you find a presentation from some fresher at Oracle that is based on those ideas.

There are many factors that make this hard to control and these include: (1) for one "real" origin there are probably ten or a hundred fakes, so if you are picking at random you strike out -- you have to not only outrank one fake you have to outrank all the fakes, (2) freshness... copies are fresher than the original, also they can be updated years later, (3) also the bad guys think a lot more seriously about indexation, Page Rank, and other variables they control than do most content creators.

zaroth10y ago

Even if you don't care about trying to identify the original source of some piece of content, it seems like the content farm site which is plagiarizing is more likely to be a lower quality site than the original content producer.

The behavior does seem weird in any case, like there is a certain slot for a given piece of content, and Google is swapping different domains in and out to fill that slot. It seems like Google is actually trying to identify the original content, failing, and then actually inadvertently penalizing the original producer.

scholia10y ago

Well, Google has already indexed a new article x. When article y appears, and Google sees that y is an almost verbatim repeat of x, it shouldn't be that hard to figure out that article x is the original, should it? Especially if they both have time/date stamps....

est10y ago

one man's stolen content is another man's mirror. There are countless times where original content is region-blocked or behind a paywall, or expired, but accessible via "stolen" links.

elorant10y ago

While most of what you say is true, they could at the very least reject AdSense applicants based on how often they copy-paste content from established publishers. I’m sure they have the means to figure something like that out.

5555510y ago· 4 in thread

I was huge into SEO for a few years. I try to stay out of it now, but it's worth noting that this is almost certainly due to the current algorithm's obsession with "freshness." The weaker site is ranking higher with the stolen content because their site was updated more recently. Steal some back and I bet they swap ranks again.

Also, the combination of the pagerank algorithm and normal user behavior typically helps Google to understand who was first and who deserves to rank higher. That is, most people don't plagiarize content, they quote it and then cite the source, which (thanks to pagerank) tends to rank the original better than sites which have plagiarized it.

scholia10y ago

> the current algorithm's obsession with "freshness."

Which is how Google makes blogspam such a good business to be in, even if your content is inferior to the post you used for "research".

5555510y ago

Most of the spam I see in the wild these days is indeed (established) dropped domains which were picked up and then loaded with thousands of pages of "fresh" spun content, with an incestuous backlink profile if any. So indeed 'blogspam'. Everything old is new again; it feels just like twelve years ago. Soon people will be keyword stuffing in a font the same color as the background...

But Google certainly isn't intending to make blogspam a good business to be in, and I'd argue that they aren't; over the past four years Demand Media's stockprice has fallen from $400/share to $4, and the general marketplace for commoditized SEO services has shrunk by a similar degree over the same period. 19 out of every 20 SEOs who were active five years ago have thrown in the towel... just check alexa graphs for the top SEO forums.

The SERPs are clean these days. Google has done an amazing job every year for at least thirteen years now of improving them constantly. The new wave of spam is social. In practice this means Buzzfeed writers stealing user-produced content from AskReddit threads and it ending up polluting my Facebook feed to the point that I can't even find any good counterfeit Raybans.

1 more reply

userbinator10y ago

the current algorithm's obsession with "freshness."

That explains why I've noticed some older sites which are still around, and have plenty of detailed technical information, seem to have disappeared from the search results. Somewhat sad that the "newer is better" mentality appears to have taken over completely... if I really wanted the newest things I'd look at Google News.

rspeer10y ago

I guess the problem is, there are some technical fields where old means useless. If I'm googling for Javascript libraries, hardware recommendations, or a fix to a package conflict in Ubuntu, I don't want something from 2010.

tomschlick10y ago· 4 in thread

And they shouldn't. Thats not their job.

JoshTriplett10y ago

Their job is, however, to direct people to the most relevant pages.

cwyers10y ago

What is their job? I thought it was as a search engine. So, let me ask -- when people steal blog content, what are their motives for doing so? Is it to deliver that content to you, the reader? Or is it to get Google hits? How well are they preserving links, illustrations, reader comments (which are a disaster a lot of places but not all of them), an archive of other work by the same author that may be of interest? How often are they slipping undesirable things (ads that lead to sites that offer malware, for instance) alongside the content they're stealing?

If Google wants to be the best search engine possible, returning the original result for an article relevant to the user's query is a better result than returning some second-hand copy littered with low-quality ad junk. And if that's not Google job, then let me know whose job it is and I'll start using them instead.

scriptproof10y ago

As it is not the job of the street vendor to know from where come these Rolex.

smt8810y ago

I don't know if you're being sarcastic, but it is illegal to sell stolen or fake merchandise in the United States. Anyone selling fake Rolexes is committing a crime and could also be sued.

Saying that Google is "selling" stolen content isn't that clear, though. Yes, they're selling ads on search results, but wouldn't they get the same ad revenue regardless of where those links pointed?

It's easier to make the case with AdSense, where Google literally profits directly from stolen content.

1 more reply

sismoc10y ago· 4 in thread

There is no such thing as "Stolen" content.

smt8810y ago

If that's true, there's also no such thing as "stealing" at all.

Consider a novelist who works for 10 years on her novel. A hacker steals the document from her computer and publishes it online under his own name. He makes $100M.

Is it wrong for the novelist to feel like someone stole from her? What word would you use instead?

AnthonyMouse10y ago

How are people still arguing about this?

Saying that someone is "stealing" when they infringe copyright is like saying someone is "killing you" when they present convincing arguments against your cause. It isn't literally stealing or killing, it's an exaggeration made for emphasis.

The reason there is so much contention is that a) the same language has been extremely common among hysterical content industry lobbyists who insist that it is literally stealing, and b) stealing and copyright infringement are both unlawful (and therefore more easily confused) even though there remains a meaningful distinction between stealing and copying.

But that distinction is very important in practice because we can't treat stealing and infringement the same. If you don't like someone's speech you can't be allowed to steal any of their webservers but you have to be allowed to copy some of their work in order to effectively criticize them.

jsizz10y ago

> What word would you use instead?

Infringing. (duh)

1 more reply

anonyfox10y ago

The book has not yet been published, so this is stealing. But once you put up something in the internet, it is officially available to everyone. Doing stuff with public information is fine IMO. Same as analyzing tweet data (tweets are public).

1 more reply

hlmencken10y ago· 3 in thread

This is not stealing, and even if it is illegal that is a bad way to put it. Also, google's service is primarily to the searcher so this isn't a huge issue for them.

smt8810y ago

> google's service is primarily to the searcher

You are very wrong. For many years, nearly 100% of Google's revenue was from AdSense.

What if someone spends days writing an article and posts it on his blog. Then, someone else copies and pastes it onto BuzzFeed, which becomes the top search result for that topic.

BuzzFeed is making money that the same blogger would have made from his own content. Now, also assume Google serves ads to BuzzFeed, but it does not serve ads to the blogger. Google has a financial interest in ignoring the provenance of the content in this case.

Is all of that ethically acceptable?

morgante10y ago

> You are very wrong. For many years, nearly 100% of Google's revenue was from AdSense.

That's incredibly untrue. A substantial portion of Google's revenue has always been and continues to be from first-party AdWords ads.

The fact that you're using BuzzFeed as an example, a firm which emphatically does not use display ads, shows how little you know about this.

magicalist10y ago

> For many years, nearly 100% of Google's revenue was from AdSense

Google has always gotten a very large majority of its advertising revenue from ads on its own sites, not on third party sites.

DarkLinkXXXX10y ago· 2 in thread

This may be pedantic, but is stolen the right word to use?

I think plagiarized is more accurate.

smt8810y ago

Plagiarism can also apply to copying content without using the exact same wording. In my mind, "stolen" means copying verbatim.

Plagiarism comes from a word meaning "kidnapping" though, so the tone of both words is pretty similar.

johansch10y ago

Stolen implies that the original owner no longer has access to the data due to the actions of the perpetrator.

3 more replies

6stringmerc10y ago· 2 in thread

Is this any better/worse than Facebook actively trying to profit and win over users when people or organizations copy / upload / soak up views for material they did not create and don't have the rights to use? Because that's a hot-point of discussion in some creative circles as well.

Houshalter10y ago

Facebook's freebooting is pretty terrible. But this can destroy entire websites. The title is misleading. Google isn't just not punishing thieves, it's heavily punishing the originals. They dropped from 20th result, to 100+, because someone stole their content.

6stringmerc10y ago

Yikes! That is much worse, at least based on your note. Do you think this is an area where the EFF could litigate on behalf of the original creators in a fraud context? Just curious, and also grateful to not be dealing with such a horrible prospect.

nkozyra10y ago· 1 in thread

Recognizing "stolen" content autonomously can only pivot on knowing when something was first published or visible to Google, which is a pretty dubious measurement.

matt_morgan10y ago

But the weird thing is the stolen content, even when it's on a crappier site with no shares etc., knocking the original out of its slot. I.e., something (probably freshness) is causing the stolen content on the crappy site to be higher-ranked than the original content on the strong site. That seems avoidable (and in Google's best interest).

randyrand10y ago· 1 in thread

The title should more accurately be Google search. Many parts of Google, for instance Youtube definitely do penalize stolen content.

walshemj10y ago

I have had a site that had UGC spam promoting dodgy TV streams - get hit with a penalty.

jeremy760010y ago· 1 in thread

Advertisement post?

mikkom10y ago

Yes but a very interesting one.

Animats10y ago

This came up before on YC.[1] Google does have a system to detect provenance, but you have to report your changes to Google as an RSS feed.[2] Google hasn't updated that page since 2010, and it may no longer do anything.

[1] https://news.ycombinator.com/item?id=10103545 [2] https://pubsubhubbub.appspot.com/

Daneel_10y ago

What a rubbish article.. It doesn't fly for a second under copyright law. Google is entirely within their rights doing what they're doing. The onus isn't on Google to detect the infringing content.

For anyone interested in copyright and legal issues, I'd recommend checking out techdirt.com. They have a great starter section at https://www.techdirt.com/blog/?tag=techdirt+feature, and they cover legal, copyright, patent, surveillance and all sorts of related topics. High quality journalism.

jakeogh10y ago

It's downright tragic to ask for more rules.

j / k navigate · click thread line to collapse

77 comments

45 comments · 13 top-level

bpodgursky10y ago· 10 in thread

I think this is pretty fair on Google's part. How could you possibly figure out who owned content?

ChuckMcM10y ago

I suspect that Google simply doesn't care. They get Ad revenue regardless and in their laissez-faire editorial position it doesn't matter. What are you going to do, use another search engine?

sounds10y ago

1 more reply

loceng10y ago

Do you know if any search engine is actively filtering for this?

1 more reply

alwaysdoit10y ago

Also all of this is assuming that any duplicated content is inherently stolen, when it could in fact be public domain, fair use, legitimately licensed, distributed under Creative Commons, etc.

blfr10y ago

True, but in these cases I would still want original/canonical/fastest/best source first while the others are probably only valuable as backups.

PaulHoule10y ago

People have been complaining about this problem for a LONG time.

zaroth10y ago

scholia10y ago

est10y ago

one man's stolen content is another man's mirror. There are countless times where original content is region-blocked or behind a paywall, or expired, but accessible via "stolen" links.

elorant10y ago

5555510y ago· 4 in thread

scholia10y ago

> the current algorithm's obsession with "freshness."

Which is how Google makes blogspam such a good business to be in, even if your content is inferior to the post you used for "research".

5555510y ago

1 more reply

userbinator10y ago

the current algorithm's obsession with "freshness."

rspeer10y ago

tomschlick10y ago· 4 in thread

And they shouldn't. Thats not their job.

JoshTriplett10y ago

Their job is, however, to direct people to the most relevant pages.

cwyers10y ago

scriptproof10y ago

As it is not the job of the street vendor to know from where come these Rolex.

smt8810y ago

I don't know if you're being sarcastic, but it is illegal to sell stolen or fake merchandise in the United States. Anyone selling fake Rolexes is committing a crime and could also be sued.

Saying that Google is "selling" stolen content isn't that clear, though. Yes, they're selling ads on search results, but wouldn't they get the same ad revenue regardless of where those links pointed?

It's easier to make the case with AdSense, where Google literally profits directly from stolen content.

1 more reply

sismoc10y ago· 4 in thread

There is no such thing as "Stolen" content.

smt8810y ago

If that's true, there's also no such thing as "stealing" at all.

Consider a novelist who works for 10 years on her novel. A hacker steals the document from her computer and publishes it online under his own name. He makes $100M.

Is it wrong for the novelist to feel like someone stole from her? What word would you use instead?

AnthonyMouse10y ago

How are people still arguing about this?

jsizz10y ago

> What word would you use instead?

Infringing. (duh)

1 more reply

anonyfox10y ago

1 more reply

hlmencken10y ago· 3 in thread

This is not stealing, and even if it is illegal that is a bad way to put it. Also, google's service is primarily to the searcher so this isn't a huge issue for them.

smt8810y ago

> google's service is primarily to the searcher

You are very wrong. For many years, nearly 100% of Google's revenue was from AdSense.

What if someone spends days writing an article and posts it on his blog. Then, someone else copies and pastes it onto BuzzFeed, which becomes the top search result for that topic.

Is all of that ethically acceptable?

morgante10y ago

> You are very wrong. For many years, nearly 100% of Google's revenue was from AdSense.

That's incredibly untrue. A substantial portion of Google's revenue has always been and continues to be from first-party AdWords ads.

The fact that you're using BuzzFeed as an example, a firm which emphatically does not use display ads, shows how little you know about this.

magicalist10y ago

> For many years, nearly 100% of Google's revenue was from AdSense

Google has always gotten a very large majority of its advertising revenue from ads on its own sites, not on third party sites.

DarkLinkXXXX10y ago· 2 in thread

This may be pedantic, but is stolen the right word to use?

I think plagiarized is more accurate.

smt8810y ago

Plagiarism can also apply to copying content without using the exact same wording. In my mind, "stolen" means copying verbatim.

Plagiarism comes from a word meaning "kidnapping" though, so the tone of both words is pretty similar.

johansch10y ago

Stolen implies that the original owner no longer has access to the data due to the actions of the perpetrator.

3 more replies

6stringmerc10y ago· 2 in thread

Houshalter10y ago

6stringmerc10y ago

nkozyra10y ago· 1 in thread

Recognizing "stolen" content autonomously can only pivot on knowing when something was first published or visible to Google, which is a pretty dubious measurement.

matt_morgan10y ago

randyrand10y ago· 1 in thread

The title should more accurately be Google search. Many parts of Google, for instance Youtube definitely do penalize stolen content.

walshemj10y ago

I have had a site that had UGC spam promoting dodgy TV streams - get hit with a penalty.

jeremy760010y ago· 1 in thread

Advertisement post?

mikkom10y ago

Yes but a very interesting one.

Animats10y ago

[1] https://news.ycombinator.com/item?id=10103545 [2] https://pubsubhubbub.appspot.com/

Daneel_10y ago

What a rubbish article.. It doesn't fly for a second under copyright law. Google is entirely within their rights doing what they're doing. The onus isn't on Google to detect the infringing content.

jakeogh10y ago

It's downright tragic to ask for more rules.

j / k navigate · click thread line to collapse