But I am sure that it would be logic failure to conclude that it must be out of date simply because they've been indexing for several years. With that logic, Google would be further out of date, having indexed for over a decade.
Does any one know how does this compare to let say Yahoo BOSS? Is it even comparable?
As far as comparisons to Yahoo BOSS are concerned, no, we are definitely not comparable to Yahoo BOSS or other such APIs that run on top of an already built (and properly ranked) inverted index of the web. At this stage we only produce bulk snapshots of what we crawl, and we are focusing our engineering resources on improving the frequency and coverage of crawl (the results of which will hopefully start to bear fruit in early 2012). Perhaps at some point in the near future, we can partner with the community to build a rudimentary full-text inverted index of the Web that we can make available in bulk via S3 as well.
From what I have heard BOSS continues to do very well and is pointed at internally as how to turn an API into a real business and product.
One more note, I am now at Factual where we are very happy consumers of the CommonCrawl service.
Take crawl freshness. If I publish a new blog post, it gets crawled and added to the Google index in seconds. Other crawling efforts take weeks between refreshes.
In the 2004 timeframe, Yahoo was crawling about the same number of pages as Google. (More some months.)
> If I publish a new blog post, it gets crawled and added to the Google index in seconds. Other crawling efforts take weeks between refreshes.
Time from crawl to appearing in search results is a different issue.
I think all projects should have sample datasets. It simplifies a lot of things, and in this case stops hundreds of geeks burning through bandwidth before they realize they don't have a clue what they are going to do with the data.
The crawl file presumably contains the contents of websites and so the owners of those websites could assert that Common Crawl Foundation is distributing their work without permission or license.
There are all sorts of republishing/splog 'opportunities' with this crawl data that goes beyond the original expected use.
Surprisingly, I couldn't see anything about this covered in the FAQs
http://www.commoncrawl.org/about/terms-of-use/
-Violate other people’s rights (IP, proprietary, etc.)
Many content sites have protections in place to recognize bots by their behavior or use "honeypots" to tell bots apart from human visitors and thus avoid large scale content theft.
I personally would LOVE to have a simple list of the domainnames themselves without all of the connections and documents.
Also: Why not just use bittorrent to distribute it?
With S3, you could boot up a bunch of Hadoop processes, pull it (without incurring any bandwidth costs I believe), process it and dump out whatever you want.
I'll just leave this here: http://training.fogcreek.com
Were there any other reasons to not use Nutch (performance, etc.)?
I'd love to hear more about the stack you're using to perform the crawls. If you don't mind sharing, it would be very interesting to read about the costs involved in gathering this data (how many machines, how long did it take, etc.)
Any plans to open source that as well? In addition to a general lack of open web crawl data freely available, there are precious few open source projects (if any) that produce high quality crawlers able to deal with the modern web.
I work at Common Crawl.Thanks for all the interest and the good questions! Lisa
I now regret since this one got much more attention. I was under the impression that linking to the original post was more welcomed here HN, but it seems this is not always the case.