undefined | Better HN

0 pointsmarginalia_nu3y ago0 comments

1 million websites, a bit above 60 million documents in the index; the crawl is a couple of hundred million but a lot of it gets filtered out for various reasons.

The crawler itself is aware of 470 million URLs.

I've actually had it up to 50 million before, but that was a lot noisier data with fewer keywords per document. The current 60 million is significantly "bigger" than the old 50 million. Index size is not actually a great metric for how comprehensive a search engine is. A small index with good signal-to-noise ratio is much more useful than a large one where 95% is chaff.

100 million is my current goal. I think that's about what's doable on my current hardware. It also gets increasingly unwieldy to deal with the data. I've already got processes that require several days non-stop computation.

0 comments

3 comments · 2 top-level

potamic3y ago· 1 in thread

For sure, a large index by itself doesn't mean anything. I was more curious about the size on disk and how you manage it on a single machine.

Also curious now, why you say half a 470m URLs? :)

marginalia_nuOP3y ago

Size of disk is like 3-400 Gb I think. Fairly manageable. I think it would require significantly more hardware with a multi-node approach. Locality is hella efficient.

I accidentally a word while editing the sentence.

dr_dshiv3y ago

I really appreciate your work.

j / k navigate · click thread line to collapse