Question: Would you handle spidering your website from AWS and Azure?

1 pointsrrwo3y ago0 comments

At work, we recently noticed a massive increase in hits on the website, mostly coming from cloud services like Amazon and Azure.

The hits suggest that somebody is spidering our site, using fake user agent strings that pretend to be regular web browsers.

In general, the hits we get from each IP are low (1-15 hits/minute) but we are getting hit from many IPs at once so it's actually increasing the load on our web server significantly, in some cases affecting the ability of actual customers to use our website.

Because this all happened at the same time, we suspect it is the same organization doing this. (We have complained to the abuse contacts for those IP ranges.)

We've not found a pattern to the traffic that would allow us to create rules for blocking traffic using something like modsecurity, besides the IP addresses being associated with cloud services.

We don't know who or why, but we'd like to limit or block the traffic.

There are several ways to do this:

1. Block traffic at the IP tables level.

This is what I've done temporarily, and it is my preferred option. We're not going to get many real customers using our website through AWS etc.

But the downside is that smaller, legitimate search engines and other services that use cloud services will be unable to index our site. (This is actually the concern from some of my colleagues.)

2. Block traffic to IP ranges via the web server

This will allow us to log the traffic, and show a message stating that we are blocking their requests due to overloading.

The downside is that it may still block legitimate traffic, and that badly written bots may ignore HTTP errors and continue hitting our site.

We could integrate that with fail2ban so that a host that repeatedly ignores these errors gets blocked for a temporary period of time.

3. Use a grey list with severe rate limiting.

Requests from these IP ranges will be rate limited (possibly by tracking the net block) to something like 5 hits/minute.

The downside is that it won't protect against dozens of hosts hitting the site from different net blocks. And I suspect the bots will ignore HTTP errors anyway.

However, we could integrate this with fail2ban, so that a host that repeatedly ignores HTTP errors gets blocked for a temporary period of time.

This is my second choice.

4. Increase the load capacity on the web server.

Our web server has handled pages going viral and getting thousands of requests/minute from actual humans, with no significant effect on performance.

Some I'm not a fan of increasing the capacity simply to handle bots that are unlikely to benefit the company in any way.