undefined | Better HN

0 pointsYawningAngel6y ago0 comments

I'm not sure that's true. How can one automatically determine whether a page is unique or static? As a trivial example, a URL path that accepts arbitrary strings and hashes them generates unique, immutable pages, but obviously cannot be crawled in its totality.

0 comments

JetSpiegel6y ago

> How can one automatically determine whether a page is unique or static?

They crawled it for years and it never changed? It is a blog post.

zamadatix6y ago

The person you are replying to said "unique immutable pages", by definition you would be able to crawl these for years and they would never change. [1] is a site that contains all possible 3200 page books with the ability to consistently index content as an example.

[1] http://libraryofbabel.info/About.html

tempestn6y ago

So, this issue isn't about sites that Google can't crawl in totality, it's about sites where they discard pages that they have crawled. If a site has less than [large number] of pages, there would be no need to worry about it; they could just index them all. But it's not like their indexing algorithm is operating naively either—for sites with a lot of pages, there's plenty of analysis they can do to determine things like whether the pages contain coherent text and other such things, to determine whether the information is worth indexing.

In the case described here though, these pages were actually indexed at one point; Google just decided that once they reached a given age, they were no longer necessary to remember. They could have simply decided to keep them instead.

j / k navigate · click thread line to collapse