>... but then you can snapshot the DOM, grab the text, and discard all the restYes, absolutely, I didn't mean to imply otherwise. But first you have to figure out what you can discard beyond the HTML tags themselves to avoid indexing all the garbage that is on each and every page.
When I tried to do this I came to the conclusion that I needed to actually render the page to find out where on the page a particular piece of text was, what font size it had, if it was even visible, etc. And then there's JavaScript of course.
So what I'm saying is that storing a couple of kilobytes is probably not the most costly part of indexing a page.