Great article -- we're excited there's so much interest in the web as a dataset! I'm part of the team at Common Crawl and thought I'd clarify some points in the article.
The most important is that you can download all the data that Common Crawl provides completely for free, without the need to pay S3 transfer fees or process it only in an EC2 cluster. You don't even need to have an Amazon account! Our crawl archive blog posts give full details for downloading[1]. The main challenge then is storing it, as the full dataset is really quite large, but a number of universities have pulled down a significant portion onto their local clusters.
Also, we're performing the crawl once a month now. The monthly crawl archives are between 35-70 terabytes compressed. As such, we've actually crawled and stored over a quarter petabyte compressed, or 1.3 petabytes uncompressed, so far in 2014. (The archives go back to 2008.)
Comparing directly against the Internet Archive datasets is a bit like comparing apples to oranges. They store images and other types of binary content as well, whilst Common Crawl aims primarily for HTML, which compresses better. Also, the numbers used for Internet Archive were for all of the crawls they've done, and in our case the numbers were for a single month's crawl.
We're excited to see Martin use one of our crawl archives in his work -- seeing these experiments come to life the best part of working at Common Crawl! I can confirm that optimizations will help you lower that EC2 figure. We can process a fairly intensive MR job over a standard crawl archive in afternoon for about $30. Big data on a small budget is a top priority for us!
[1]: http://blog.commoncrawl.org/2014/11/october-2014-crawl-archi...
I often wanted ability to filter and group when staring a pages of results from redhat jboss support forums when trying to fix a dead jboss cluster. But I quit that job so haven't had the need recently.
Edit; point being it would be nice if someone came up with a service that implemented the ideas in this paper :)
As an example, I was trying a project to look at the top news sites, like BBC, CNN, Al Jazeera, etc. Then matching articles about the same news topic on each site, before finally fact checking for differences between the stories (ie. 20,000 homes were without power vs. 50,000)
That kind of project requires a load of crawling, but I can't look for specific pages without processing the entire CC set first.
I love the project though, so thank you for doing it!
AWS optimizations: For that level of cost efficiency, you really need to use spot instances. A cluster of 100 m1.xlarge machines (1.5TB of RAM, 400 cores, and 168TB of magnetic disk storage) will only cost you $3 per hour using spot instances, rather than $30 on-demand. You should pay on-demand prices for the Hadoop master however.
You'll also want to roll your own Hadoop cluster as opposed to using Elastic MapReduce (EMR). EMR is amazing but the cost overhead when using it on spot instances is ~100%.
For the code itself, this is a situation where you'll want to stick with the programming language that has both the best performance and best ecosystem. I'm personally not a big fan of Java, but it really does win out here -- it's close to C or C++ for performance and has the advantage of the Hadoop ecosystem behind it. Other languages are certainly usable, but even if LanguageX only ran 4x slower than Java, the resulting job would be 4x more expensive due to paying by the hour.
Other than that, it's really just a standard MapReduce job using Hadoop. You can see an three examples for the three different data formats we use at: https://github.com/commoncrawl/cc-warc-examples/
I'm just curious as to how commoncrawl compares with kimonolabs and import.io as they seem to have the same goal of creating an internet as a dataset, or an API. I can't help but feel like it's just solving another 'semantic web' problem that nobody asked for.
It is funny that the most demanding customers of semantic web are also the ones who are willing to pay the least amount of time and money.
Regarding point two: "monopoly businesses who viciously protect their human uploaded content". I spend a lot of time scraping these monopoly businesses, and it seems to me they do a decent job of letting their users decide what data is exposed. Facebook, linkedin, and google all are decent about letting me scrape their public info. That's all I have a right to -- private info should stay private, at the behest of the owner (the User in UCG).
You are correct regarding the third point, but I don't see that as a problem. This isn't a solution in search of a problem -- it's a problem without a solution at the moment.
Here's a toy example of something I'd like to do: calculate the positive / negative sentiment of commenters at particular baseball fan sites, so I can hide the content I don't like, and show that which I do. Having a common crawl of the site would be immensely useful (and is indeed a prereq) for this. I wouldn't need to republish it, just compute on it.
It would be so nice to go to xyz.com and actually find what I am looking for in under 1 second.
I think we need to optimize local search engines and aggregate (in some ways) all this into a global search engine.
https://www.google.com/work/search/products/gss.html
I used it several years ago to power this site search:
http://www.poetryfoundation.org/search/articles#qs=ginsberg
It works pretty well (disclaimer: I'm not sure this is still being used on that site)
Full-text indexing (what ES provides) has been around for almost forever, ES just does a way better job of productizing/delivering it.
However, Google is far more than a text index. Ranking, currently is still very difficult and requires messing around with facet and weighting parameters.
In my opinion which means nothing, sites need to figure out how to power their own search. Using a third party isn't going to work for most. Maybe people need to focus on building custom architecture that indexes the data in a more structured way rather than cobbling systems together that ultimately hinder search efforts when its time to get the user what they want. I don't know the answer but somebody eventually will. Maybe wordpress will create a powerful search for all those wordpress sites.
Why use humans? People can decide if your navigation is intuitive. They can decide if your page looks like crap. If 230,000 people are searching for "coconut oil" per month (actual numbers) then it's worth having an intern spend 15 minutes to make sure page 1 of "coconut oil" looks right.
Google can afford that. They need a human to decide if the "user experience" is actually good vs. disallowing the back button and forcing the browser to crash, which is how I suppose you could fake a "time on site" metric if this was just an algorithmic problem.
Google is now more like playing Zork. You type "Go North" like 10 million other people before you typed "Go North" and Google has already crafted that experience you'll find in next room. (Which makes me wonder, do they score how boring you are based on predictability?) This is becoming more and more obvious over time as a search for "calculator" shows you an actual calculator that a human at Google created. That's not an algorithmic response.
Similarly, I see that human touch coming more into play with voice recognition, Google Glass, Siri, etc. Call that "AI" or whatever. You ask Google a question and Google has already sculpted a slick answer based on tons of testing. That's how I see Google as a search engine now. Part of the crawling is interesting (recognizing objects in photos?) but I think human reviews of all the important websites and SERPs, that's harder for a competitor to reproduce.
Google was forced into that by improved "search engine optimization". SEO used to be about things like keyword stuffing, but as Google made their search engine smarter, SEO companies made their search spamming smarter. There are now SEO operations using machine learning to reverse engineer Google's algorithms and then automatically spam just enough to stay under the threshold.
In 2010, Google tried using "local" data to improve search. That turned out to be extremely easy to spam. A classic example of this can be found by searching for "laptop repair bradford pa". This brings up "Illusory Laptop Repair", located in the middle of a railroad crossing. A SEO expert created that phony business listing to demonstrate how bad Google was at detecting such spam. Google still thinks it's real.
In 2012, Google tried using "social" data to improve search. That worked even worse. Fake Google accounts created to create fake "+1"s may have exceeded the number of real ones. Google "+1" are still for sale; the going rate is about $0.10 each.
Meanwhile, links aren't as useful as they used to be. Who creates a link to a retail outlet other than on social media any more? Google is trying all sorts of "signals", but in heavily spammed areas, they're not doing all that well.
Yandex has been trying search that doesn't weight links at all for some heavily spammed categories in the Moscow area. It seems to be working for fake real estate ads.
(We have a partial solution - find the real-world business behind the web site and check it out in hard data sources, such as Dun and Bradstreet or Experian, which have business credit data. See "sitetruth.com/doc".)
This doesn't seem to work.
http://queue.acm.org/detail.cfm?id=988407
EDIT:
Article is clearly from an earlier era, but it's really cool to see how far we've come and how much more computing power we have available now. There are entire categories of problems that simply don't exist anymore.
It's surprising to me that there aren't search engines from Comcast, AT&T, and Apple. If you have customers, why give up all that ad revenue to Google? Google is paying some big players a lot of money not to do that. They were paying Apple $1 billion a year to be the default on Apple products. Apple switched from Google to Bing anyway.
They raised ~$30 million in two rounds, but their valuation was at $200 million by round two. I agree with your point though; the cost to develop a good search engine is dirt cheap compared to the value it brings.
To give you an example, search for "webhcat primary key" (without quotes) and note how the top three search results do not actually contain the term webhcat. Google constantly does this. It randomly ignores search terms unless you explicitly quote them.
I believe that there is still a market for a technical/advanced search engine.
I volunteered a bit early this year for Common Crawl (not much, just some Java and Clojure examples for fetching and using the new archive format).
Common Crawl already has many volunteers (and a professional management and technical staff) so it would seem like a good idea to merge some of the author's goals with the existing Common Crawl organization. Perhaps more frequent Common Crawl web fetches and also making the data available on Azure, Google Cloud, etc. would satisfy the author's desire to have more immediacy and have the data available from multiple sources.
Most of the Common Crawl data is on Microsoft Azure, but not all of it.
The Common Crawl is a great resource that deserves attention from more companies and developers.
[1] https://www.backblaze.com/petabytes-on-a-budget-how-to-build...
[2] https://www.backblaze.com/blog/why-now-is-the-time-for-backb...
1) I like the idea of human curation, but in combination with some sort of automated crawler (or other tool) that helps in the browser.
2) Why can't we also distribute the act of crawling, the maintenance of the index and the map-reduce (or other algorithm) that produces the data.
I've been thinking about architectures that would allow (in essence) a P2P search system. Would anyone be interested in talking about architectures to make this work? There are millions of computers on the web at any given time ... if it's built into the browser (or plugs in), you could have human input at the same time.
Then one would be able to do some "stuff Google can do" - say, analysing trends - albeit with worse sampling, and not depend that much on them.