undefined | Better HN

0 pointswebtechgal9y ago0 comments

Great work, congrats. :-)

Here is some input based on my experience building a similar project at my former company. (We did not quite get to 2B pages, but were close to ~300M):

For creating a really viable (alternative) search engine, the freshness of your index is going to be a fairly important factor. Now, obviously, re-crawling a massive index frequently/regularly is going to need/consume some huge amounts of bandwidth + CPU cycles. Here is how we had optimized the resource utilization:

Corresponding to each indexed URL, store a 'Last Crawled' time-stamp.

Corresponding to each indexed URL, also store a sort-of 'crawl-history' (If space is a constraint, don't store each version of the URL, store only the latest one). On each re-crawl, store two data fields: time-stamp and a boolean if the URL content has changed since last crawl. As more re-crawl cycles run, you will be able to calculate/predict the 'update frequency' of each URL. Then, prioritize the re-crawls based on the update frequency score (i.e. re-crawl those with higher scores more frequently and the others less frequently).

If you need any more help/input, let me know and I'll be happy to do what I can.

HTH and all the best moving forward.

0 comments

3 comments · 1 top-level

webtechgalOP9y ago· 2 in thread

We had also (obviously) built a (proprietary) ranking algo that took into account some 60+ individual factors. If it can be of any help, I'll create a list and send it to you.

ddorian439y ago

Why not write that list here ?

webtechgalOP9y ago

Good idea. However, I'll need to really exercise the gray cells to put together the list so it might take me a couple of days. Once done, I'll post it here.

j / k navigate · click thread line to collapse