undefined | Better HN

0 pointsdougb51y ago0 comments

When you say "big hitters", I guess you mean the well-known corporate crawlers like GPTBot (one of OpenAI's). Yes, these do tend to identify themselves --and they tend to respect robots.txt, too -- but they're a small part of the problem, from my perspective. Because there's also a long tail of anonymous twerps training models using botnets of various kinds, and these folks do not identify themselves, and in fact they try to look like ordinary users. Collectively these bots use way more resources than the name-brand crawlers.

(My site is basically a search engine, which complicates matters because there's effectively an infinite space of URLs. Just one of these rogue bots can scrape millions of pages from tens of thousands of IPs; and I think there are hundreds of the bots at any given moment...)

0 comments

1 comments · 1 top-level

radium3d1y ago

Actually, it's the Korean and Chinese bots that hit pretty hard without much (or any) throttling.

j / k navigate · click thread line to collapse