To start with I'm just going to index as much data I can fit on an entry-level cloud machine and because I am very poor I shall be asking for donations to further the scope of the index.
Say I start with Wikipedia and The Gutenberg project and a couple of news sites. The first two will be easy, they have dumps of their data and I also don't think Wikipedia would mind at all if I put a tiny amount of preasure on their servers for the good cause of building a free, anonymous and open web search. But what about the rest of the internet? Will they mind?
People crawl and scrape the web all the time for different purposes. I'm looking for some advise so that I don't piss anyone off with my crawler. What tools/strategies do you suggest I use?
Cheers!