undefined | Better HN

0 pointslucb1e6y ago0 comments

I match the user agent string on containing "Google". It seems that Google Chrome only includes "Chrome", so I don't block users this way. Here is an overview of all Google crawlers' user agent strings: https://support.google.com/webmasters/answer/1061943 As you can see, they all include Google (capitalized).

I don't use robots.txt because they say that doesn't stop them from including the site in search results: https://support.google.com/webmasters/answer/6062608 I don't know if returning a HTTP 403 error will, but it seems like it's worth a try.

I also looked into banning IP ranges (that would have been my preferred option), but if I remember correctly they were subject to change and it seems overkill to write a scraper for that page that would then have to generate a config file and reload a service.

0 comments

tonfa6y ago

The documented way is the noindex tag (in html or http headers): https://support.google.com/webmasters/answer/93710?hl=en

lucb1eOP6y ago

Not all resources are HTML so I couldn't use the meta tag, but the header looks interesting! Reading up on it, it seems to achieve pretty much the same thing as my current solution. Would you say this is better for some reason? Nobody should encounter my server's 403 response except those with a Google user agent anyway.

The page doesn't say whether this works the same as the robots.txt disallow, where you may still appear in results because other pages link to you. The 403 might be more effective, but I can't really tell either way.

appleflaxen6y ago

great point, but at that point you have to trust google with your data.

and if you are taking this action in protest, you probably don't.

if you don't serve the data at all, whether or not to respect your "noindex" is imposed on google, rather than being a suggestion (like "do not track" in the chrome browser; we all know how that turned out)

j / k navigate · click thread line to collapse