undefined | Better HN

0 pointscookiecaper9y ago0 comments

First, search engines are scrapers. No need to make a distinction.

Second, search engines don't always respect robots.txt. They sometimes do. Even Google itself says it may still contact a page that has disallowed it. [0]

Third, robots.txt is just a convention. There's no reason to assume it has any binding authority. Users should be able to access public HTTP resources with any non-disruptive HTTP client, regardless of the end server's opinion.

[0] "You should not use robots.txt as a means to hide your web pages from Google Search results. This is because other pages might point to your page, and your page could get indexed that way, avoiding the robots.txt file." / http://archive.is/A5zh8

0 comments

5 comments · 3 top-level

hobs9y ago· 2 in thread

And the original point of my comment was that doing this is extremely rude and not appropriate, not that it couldn't be done or that others weren't doing it.

Feel free to send any request to any server you want, it is certainly up to them to decide whether or not to serve it, but that doesnt absolve you of guilt from scraping someone's site when they explicitly ask you not to.

cookiecaperOP9y ago

Please don't conflate "extremely rude", "not appropriate", and "guilt". Two of these are subjective opinions about what constitutes good citizenship. The last one is a legal determination that has the potential to deprive an individual of both his money and liberty. We're discussing whether these behaviors should be legal, not whether they are necessarily polite.

hobs9y ago

I never did.

You are posting in a comment thread underneath my reply about rudeness and impoliteness, ironically being somewhat rude telling me off about what not to conflate when it was never what I said.

nostrademons9y ago

In the Google quote you link to, Google is not contacting your page. Rather, Google will index pages that are only linked to, which it has never crawled, and will serve up those pages if the link text matches your query. That's how you get those search results where the snippet is "A description of this page has been blocked by robots.txt" or similar.

There's a somewhat related issue where to ensure your site never exists in Google, you actually need to allow it to be crawled, because the standard for that is a "<meta name=noindex ...>" tag, and in order to see the meta noindex, the search engine has to fetch the page.

tedunangst9y ago

Google will put forbidden pages in its index. It doesn't scrape them. (The URL to the page exists even without visiting the page.)

j / k navigate · click thread line to collapse