I've been working on this for almost a decade and being this close to the finish line feels surreal. This is my seventh iteration I believe. Everytime I realized I had an architecture built on a bad model I walked away, often in fury, depressed over my weak coding abilities but bouncing back pretty quick. I've gotten to a place where I'm comfortable throwing away months and months of coding. When starting anew I always did it fresh.
I did this from love and facination of search and NLP and to get good at coding. I started late in life as a professional programmer and have always felt the need to catch up with those younger than me. Today I feel like I achieved something.
Here is a demo of a search engine [0] built on ResinDB [1]:
[0] http://searchpanels.com/ [1] https://github.com/kreeben/resin/
To start with I'm just going to index as much data I can fit on an entry-level cloud machine and because I am very poor I shall be asking for donations to further the scope of the index.
Say I start with Wikipedia and The Gutenberg project and a couple of news sites. The first two will be easy, they have dumps of their data and I also don't think Wikipedia would mind at all if I put a tiny amount of preasure on their servers for the good cause of building a free, anonymous and open web search. But what about the rest of the internet? Will they mind?
People crawl and scrape the web all the time for different purposes. I'm looking for some advise so that I don't piss anyone off with my crawler. What tools/strategies do you suggest I use?
Cheers!