story
> it could be good for a particular use case
Namely?
> or for learning from.
The author admitted in the github readme that the code quality is rather bad. I also don't see a link to the search index, the only valuable component of this project.
There is also a free API in beta-test right now. Will probably be ready for official release next week.
But if they can, I think a big part of it would be separation of the crawl index and the UI / prioritisation etc. Different people can work on those two ends of the problem and apply different philosophies.
Search only forums? Reject porn using XYZ method? Great! But they can all use the same database, or pick from the a common community of databases.
Have you also published the ranking mechanism? That way people might contribute you to improve it.
2 are used for crawling, index-building and raw-data storage. Quadcore, 32gb RAM, 4tb HDD and 1gbit/s internet connection on each of these. They are rented and in a big data-center. Crawling uses "only" about 200-250mbit/s of bandwidth.
2 servers for webserver and queries. Quadcore, 32gb RAM. One with 2x512gb SSD, the other with only 1x512gb SSD. These servers are here at home. I have cable internet with 200mbit/s down, 20mbit/s up. Static IPs obviously.
A full crawl currently takes about 3 months.