There's been numerous posts on HN about people getting slammed, to the tune of many, many dollars and terabytes of data from bots, especially LLM scrapers, burning bandwidth and increasing server-running costs.
I'm largely suspecting that these are mostly other bots pretending to be LLM scrapers. Does anyone even check if the bots' IP ranges belong to the AI companies?
When I worked at Wikimedia (so ending ~4 years ago) we had several incidents of bots getting lost in a maze of links within our source repository browser (Phabricator) which could account for > 50% of the load on some pretty powerful Phabricator servers (Something like 96 cores, 512GB RAM). This happened despite having those URLs excluded via robots.txt and implementing some rudimentary request throttling. The scrapers were using lots of different IPs simultaneously and they did not seem to respect any kind of sane rate limits. If googlebot and one or two other scrapers hit at the same time it was enough to cause an outage or at least seriously degrade performance.
Eventually we got better at rate limiting and put more URLs behind authentication but it wasn't an ideal situation and would have been quite difficult to deal with had we been much more resource-constrained or less technically capable.
Sounds like a fun project for an AbuseIPDB contributor. Could look for fake Googlebots / Bingbots, etc, too.
What better way to show the effectiveness of your solution, than to help create the problem in the first place.