for(listofurls) { geturl; add urls to listofurls; }
Doing it on a large scale over and over is a harder problem (which common crawl does for you) but its not too difficult until you hit scale or want realtime crawling.
Building an index on 210 TB of data however... Assuming you use Sphinx/Solr/Gigablast you are going to need about 50 machines to deal with this amount of data with any sort of redundancy. That's just to hold a basic index which is not including "pagerank" or anything (Gigablast is a web engine so it might have that in there not sure). You aren't factoring in adding rankers to make it a webs search engine, spam/porn detection and all of the other stuff that goes with it. Then you get into serving results. Unless your indexes are in RAM you are going to have a pretty slow search engine. So add a lot more machines to hold the index for common terms in memory.
If someone is keen to do this however here are a list of articles/blogs which should get you started (wrote this originally as a HN comment which got a lot of attention so made it into a blog post) http://www.boyter.org/2013/01/want-to-write-a-search-engine-...
What I heard about a smaller search engine was that web crawling is usually augmented with some manually added rules for various sites to prevent spoiling the database. Not a trivial task at all.
Doing queries is IMHO algorithmically much better understood, because it's a constrained problem. But getting information extracted out from the real world, with all the PHP and HTML "hackers", not so easy.
I wonder if there is a viable business in maintaining an in-memory & up-to-date index of the public web & selling access to it, with a pricing model that scales according to the amount of computation you are doing on it.
Table 2a purports to show the frequency of SLDs:
1 youtube.com 95,866,041 0.0250
2 blogspot.com 45,738,134 0.0119
3 tumblr.com 30,135,714 0.0079
4 flickr.com 9,942,237 0.0026
5 amazon.com 6,470,283 0.0017
6 google.com 2,782,762 0.0007
7 thefreedictionary.com 2,183,753 0.0006
8 tripod.com 1,874,452 0.0005
9 hotels.com 1,733,778 0.0005
10 flightaware.com 1,280,875 0.0003
If I'm reading this correctly, it seems that the crawler managed to hit up a huge number of youtube video pages...but only a fraction of them. I couldn't find a total number of Youtube video count, but Youtube's own stats page says 200 million videos alone have been tagged with Content-ID (identified as belonging to movie/tv studios).
In any case, it's surprising to not see Wikipedia on there. English wikipedia has 4+ million articles, so it should be ahead of thefreedictionary.com
Some crawlers are most interested in freshest versions of the most inlinked articles, or in the exact HTML presentation at Wikipedia.
The monthly full raw wikitext dumps don't provide that.
And, Wikipedia's serving plant is pretty efficient, with bandwidth only being a small portion of their costs. They can afford some crawling... and correspondingly, their /robots.txt is pretty open.
Good crawlers seeking just the bulk text shouldn't try to grab the whole thing as fast as possible via the standard web URLs... but other good crawlers may want or need to visit discovered Wikipedia links, and doing so at a measured pace should be OK.
$ cci_lookup org.wikipedia.en | wc -l
2516956
(See https://github.com/wiseman/common_crawl_index, but note that the index is incomplete.)Actually, tomorrow a video on a startup that uses Common Crawl data is getting posted.
and you just need to comply with the Common Crawl TOU: http://commoncrawl.org/about/terms-of-use/
Some good stuff!
From there you can grab the S3 command line tools (http://s3tools.org/s3cmd) or load it up from hadoop or through one of the various open source libraries (boto for instance).