What if I published a book, it was copy-pasted in blogs, and then later I put it somewhere crawlable by Google? You certainly can't just say "first time we saw it, that's the proper owner". It would either require a massive amount of manual QA to get right (and even then, there are going to be interminable copyright battles), or have a super high error rate.
I think Google's best value is letting proper content owners easily find violators via normal searches, and let them deal with them via takedown notices or the court system -- which is where it should be done, not in a pseudo-court run by a Google who does not want what responsibility.
I suspect that Google simply doesn't care. They get Ad revenue regardless and in their laissez-faire editorial position it doesn't matter. What are you going to do, use another search engine?
Duplicate removal is essential for making a web search engine that works. For instance, together with a CS research group, I built a search engine for a major university library that had more than 80 web sites. We found huge amounts of duplicate content produced by various mechanisms (for instance, multiple people posted the same stuff to the web.) If your ranking is content-based, all of the duplicate documents are going to rank the same and form a "plug" that excludes other documents.
It has long (post 2006) been a common story that "I wrote a blog post but somebody else ranks for it." For instance, I made a blog post that got a huge amount of traffic in the day, but right now you search for it and you find a presentation from some fresher at Oracle that is based on those ideas.
There are many factors that make this hard to control and these include: (1) for one "real" origin there are probably ten or a hundred fakes, so if you are picking at random you strike out -- you have to not only outrank one fake you have to outrank all the fakes, (2) freshness... copies are fresher than the original, also they can be updated years later, (3) also the bad guys think a lot more seriously about indexation, Page Rank, and other variables they control than do most content creators.
The behavior does seem weird in any case, like there is a certain slot for a given piece of content, and Google is swapping different domains in and out to fill that slot. It seems like Google is actually trying to identify the original content, failing, and then actually inadvertently penalizing the original producer.
Also, the combination of the pagerank algorithm and normal user behavior typically helps Google to understand who was first and who deserves to rank higher. That is, most people don't plagiarize content, they quote it and then cite the source, which (thanks to pagerank) tends to rank the original better than sites which have plagiarized it.
Which is how Google makes blogspam such a good business to be in, even if your content is inferior to the post you used for "research".
But Google certainly isn't intending to make blogspam a good business to be in, and I'd argue that they aren't; over the past four years Demand Media's stockprice has fallen from $400/share to $4, and the general marketplace for commoditized SEO services has shrunk by a similar degree over the same period. 19 out of every 20 SEOs who were active five years ago have thrown in the towel... just check alexa graphs for the top SEO forums.
The SERPs are clean these days. Google has done an amazing job every year for at least thirteen years now of improving them constantly. The new wave of spam is social. In practice this means Buzzfeed writers stealing user-produced content from AskReddit threads and it ending up polluting my Facebook feed to the point that I can't even find any good counterfeit Raybans.
That explains why I've noticed some older sites which are still around, and have plenty of detailed technical information, seem to have disappeared from the search results. Somewhat sad that the "newer is better" mentality appears to have taken over completely... if I really wanted the newest things I'd look at Google News.
If Google wants to be the best search engine possible, returning the original result for an article relevant to the user's query is a better result than returning some second-hand copy littered with low-quality ad junk. And if that's not Google job, then let me know whose job it is and I'll start using them instead.
Saying that Google is "selling" stolen content isn't that clear, though. Yes, they're selling ads on search results, but wouldn't they get the same ad revenue regardless of where those links pointed?
It's easier to make the case with AdSense, where Google literally profits directly from stolen content.
Consider a novelist who works for 10 years on her novel. A hacker steals the document from her computer and publishes it online under his own name. He makes $100M.
Is it wrong for the novelist to feel like someone stole from her? What word would you use instead?
Saying that someone is "stealing" when they infringe copyright is like saying someone is "killing you" when they present convincing arguments against your cause. It isn't literally stealing or killing, it's an exaggeration made for emphasis.
The reason there is so much contention is that a) the same language has been extremely common among hysterical content industry lobbyists who insist that it is literally stealing, and b) stealing and copyright infringement are both unlawful (and therefore more easily confused) even though there remains a meaningful distinction between stealing and copying.
But that distinction is very important in practice because we can't treat stealing and infringement the same. If you don't like someone's speech you can't be allowed to steal any of their webservers but you have to be allowed to copy some of their work in order to effectively criticize them.
You are very wrong. For many years, nearly 100% of Google's revenue was from AdSense.
What if someone spends days writing an article and posts it on his blog. Then, someone else copies and pastes it onto BuzzFeed, which becomes the top search result for that topic.
BuzzFeed is making money that the same blogger would have made from his own content. Now, also assume Google serves ads to BuzzFeed, but it does not serve ads to the blogger. Google has a financial interest in ignoring the provenance of the content in this case.
Is all of that ethically acceptable?
That's incredibly untrue. A substantial portion of Google's revenue has always been and continues to be from first-party AdWords ads.
The fact that you're using BuzzFeed as an example, a firm which emphatically does not use display ads, shows how little you know about this.
Google has always gotten a very large majority of its advertising revenue from ads on its own sites, not on third party sites.
I think plagiarized is more accurate.
Plagiarism comes from a word meaning "kidnapping" though, so the tone of both words is pretty similar.
[1] https://news.ycombinator.com/item?id=10103545 [2] https://pubsubhubbub.appspot.com/
For anyone interested in copyright and legal issues, I'd recommend checking out techdirt.com. They have a great starter section at https://www.techdirt.com/blog/?tag=techdirt+feature, and they cover legal, copyright, patent, surveillance and all sorts of related topics. High quality journalism.