I really think a subset like this will increase the value as it would allow people writing search engines (for fun or profit) to suck a copy down locally and work away. Its something I would like to do for sure.
While it's nice to have generalist search engines, it would be even better to be able to unbundle the generalist search engines completely. Verticals such as the following would be nice:
1) Everything linux, unix and both
2) Everything open-source
3) Only news & current events
4) Popular culture globally and by country
5) Politics globally and by country
6) Everything software engineering
7) Everything hardware engineering
8) Everything maker community
9) Everything financial markets
10) Everything medicine / health (sans obvious quackery)
11) etc.
Maybe make a tool that allows the community to create the subset creation recipes that perform the parsing out of data of a certain type and that the community forks and improves over time.
The time to create a generalist search engine has sailed, but specialist search engines is total greenfield.
Where can I read more about this?
It looks like they've fixed the first problem by switching to gzipped WARC files, but I can't find any information about whether or not they're still truncating documents in the archive. I guess I'll have to give it another look and see...
At one point I was tried out a 10 MB limit, but the thing is we try to limit crawls to webpages and few are that big, but occasionally we'd hit sites ISDN-speed connections that would slow down the whole thing.
For the next crawl, we'll mark which pages are truncated and which aren't (an oversight in the last crawl) so at least you can skip over them.
Also, hopefully you'll find the new metadata files to be a little clearer. We switched over the same format Internet Archive uses and it contains quite a bit more data (xpath truncated paths for each link for instance).
I have heard about this project numerous times, and am always dissuaded by the lack of download links/torrents/information on their homepage.
Perhaps I just don't know what I'm looking at?
http://commoncrawl.org/get-started/
I haven't tried that one, but I've poked at other of the Amazon Common Datasets collection:
http://aws.amazon.com/datasets
If you're already familiar with using Amazon's virtual servers, it's pretty straightforward.
I also note that the Common Crawl project publishes code here:
That would be a great starter for all sorts of fun little weekend experiments.