However, surely they have enough smart engineers there to realize that running a bot at full speed (and, based on other reports, completely ignoring robots.txt) will get them blocked by a lot of sites.
If they just had a well behaved spider, almost no one would mind. Getting crawled is a fact of life on the internet, and most website owners recognize it as an essential cost of doing busses. Once you get a reputation as a bad spider, though, that is very hard to shake.
Also, does it really require a specific "gem"? This is HTTP request filtering, the router (as in the real router, like the metal box with network cables) can probably do it by itself these days.
Also why should they not respect the 403? Crawlers just go to anything they can find. It is not a targeted attack.
Edit: Nice try on the vote brigade guys. lol
https://www.nerdcrawler.com/robots.txt
The domain serving the images is allowing everything:
- https://www.reddit.com/r/flask/comments/161fqml/what_to_do_a...
- https://www.reddit.com/r/sysadmin/comments/1ahhzdg/has_anyon...