undefined | Better HN

0 pointsmarginalia_nu2y ago0 comments

Sure, download and run the javascript, but then you can snapshot the DOM, grab the text, and discard all the rest. The HTML and js is of little practical value for the index after that point.

Google's index is likely very large because they don't have any real economic incentives to keeping it small.

0 comments

5 comments · 1 top-level

fauigerzigerk2y ago· 4 in thread

>... but then you can snapshot the DOM, grab the text, and discard all the rest

Yes, absolutely, I didn't mean to imply otherwise. But first you have to figure out what you can discard beyond the HTML tags themselves to avoid indexing all the garbage that is on each and every page.

When I tried to do this I came to the conclusion that I needed to actually render the page to find out where on the page a particular piece of text was, what font size it had, if it was even visible, etc. And then there's JavaScript of course.

So what I'm saying is that storing a couple of kilobytes is probably not the most costly part of indexing a page.

akiselev2y ago

> When I tried to do this I came to the conclusion that I needed to actually render the page to find out where on the page a particular piece of text was, what font size it had, if it was even visible, etc. And then there's JavaScript of course.

Are there open source projects devoted to this functionality? It’s becoming more and more a sticking point for working with LLMs. Grabbing the text without navigation and other crap but while maintaining formatting and links, etc

fauigerzigerk2y ago

Good question (meaning I don't know :)

For my specific purposes it has always been good enough to apply some simple heuristics. But that wouldn't have been possible without access to post rendering information, which only a real browser (https://pptr.dev) can reliably produce.

DeathArrow2y ago

There are many software libraries that can output just the text from HTML or run JS. For C# there's HTML Agility Pack and PuppeteerSharp, for example. I did use them for web scrapping.

marginalia_nuOP2y ago

You don't need to store it indefinitely though, and there's not much point in crawling faster than you can process the data.

The couple of kilobytes per document is the actual storage footprint. Sure you need to massage the data, but that almost entirely CPU bound. You also need a lot of RAM for keeping the hot parts of the index.

j / k navigate · click thread line to collapse