undefined | Better HN

0 pointsmattmaroon1y ago0 comments

Are sites really that averse to having a few more crawlers than they already do? It would seem that it’s only a monopoly insofar as it’s really expensive to do and almost nobody else thinks they can recoup the cost.

0 comments

5 comments · 2 top-level

natebc1y ago· 3 in thread

A few?

We routinely are fighting off hundreds of bots at any moment. Thousands and Thousands per day, easily. US, China, Brazil from hundreds of different IPs, dozens of different (and falsified!) user agents all ignoring robots.txt and pushing over services that are needed by human beings trying to get work done.

EDIT: Just checked our anubis stats for the last 24h

CHALLENGE: 829,586

DENY: 621,462

ALLOW: 96,810

This is with a pretty aggressive "DENY" rule for a lot of the AI related bots and on 2 pretty small sites at $JOB. We have hundreds, if not thousands of different sites that aren't protected by Anubis (yet).

Anubis and efforts like it are a xesend for companies that don't want to pay off Cloudflare or some other "security" company peddling a WAF.

zrm1y ago

This seems like two different issues.

One is, suppose there are a thousand search engine bots. Then what you want is some standard facility to say "please give me a list of every resources on this site that has changed since <timestamp>" so they can each get a diff from the last time they crawled your site. Uploading each resource on the site to each of a thousand bots once is going to be irrelevant to a site serving millions of users (because it's a trivial percentage) and to a site with a small amount of content (because it's a small absolute number), which together constitute the vast majority of all sites.

The other is, there are aggressive bots that will try to scrape your entire site five times a day even if nothing has changed and ignore robots.txt. But then you set traps like disallowing something in robots.txt and then ban anything that tries to access it, which doesn't affect legitimate search engine crawlers because they respect robots.txt.

fc417fc8021y ago

> then you set traps like disallowing something in robots.txt and then ban anything that tries to access it

That doesn't work at all when the scraper rapidly rotates IPs from different ASNs because you can't differentiate the legitimate from the abusive traffic on a per-request basis. All you can be certain of is that a significant portion of your traffic is abusive.

That results in aggressive filtering schemes which in turn means permitted bots must be whitelisted on a case by case basis.

1 more reply

mattmaroonOP1y ago

I mean sure but if there were 3 search engines instead of one would you disallow two of them? The spam problem is one thing but I dont think having a ten search engines rather than two is going to destroy websites.

The claim that search is a natural monopoly because of the impact on websites of having a few more search competitors scanning them seems silly. I don’t think it’s a natural monopoly at all.

robinsonb51y ago

A "few" more would be fine - but the sheer scale of the malicious AI training bot crawling that's happening now is enough to cause real availability problems (and expense) for numerous sites.

One web forum I regularly read went through a patch a few months ago where it was unavailable for about 90% of the time due to being hammered by crawlers. It's only up again now because the owner managed to find a way to block them that hasn't yet been circumvented.

So it's easy to see why people would allow googlebot and little else.

j / k navigate · click thread line to collapse