Edit: I’ve seen somewhere that Stackoverflow blocks all the ec2 machines. I don’t think this is the most optimal solution considering many legit services. Also the hits come from different ip’s.
1) You can block EC2 wholesale. You've mentioned issues with this, and can be bypassed via VPN or using another network. EC2 is attractive because it's so cheap (with spot instances, starts at 0.3 cents per hour), but it's not the only option.
2) Timing. Normal traffic isn't rapid fire. Many scrapers, however, fire off their scripts as quickly as possible. Block traffic that doesn't have enough meaningful pauses.
3) Report addresses to Amazon. I really don't know if they'd take action.
4) Reverse lookup, or whitelist addresses. I know if it's a legitimate source (like Flipboard) they'd probably work with you at least a little bit. Reverse lookup might not be successful, but maybe that can help you whitelist any legit sources that map their AWS IP to a legit DNS name. Most scrapers use the AWS external domain name. Also, I imagine legit sources give you a distinctive user agent, so that can help you let traffic through.
However, if you have a public resource, this is simply an issue you have to deal with. Anayltics: I'd just filter out traffic scraped traffic from your analytics. Content duping: blocking scrapers won't stop this. If someone stealing your content can't scrape at 0.5 cents per hour, they'll pay someone 5 cents an hour to copy/paste. You just have to use the same diligence others use, in terms of reporting to Google, etc. Perfomance: use Varnish/nginx/etc to combat performance hit from scrapers.
It seems hard to limit legitimate uses of a free resource without changing the requirements on how users access the site (require account signup, use CAPTCHAs, use CSS/JS to only display properly in a browser).
As one who does a lot of scraping, I have encountered few barriers that can't be (legally) overcome with a reasonable amount of effort.
On what basis are you claiming this. Sounds like it would be true under fair-use clauses of US Copyright law but it's certainly not true in the UK (and by extension I presume for you to perform on content served from the UK though I've yet to read a thorough treatment of how the [ie any] law works with server locations).
Commercial considerations are usually much broader than selling too: not only could you not resell it but you couldn't distribute it (whether by publishing or otherwise).
"Cucumbertown authorizes you to view, download and/or print the Materials only for personal, non-commercial use, provided that you keep intact all copyright and other proprietary notices contained in the original Materials."
So, for example, I could grab everything from the site (minus copyrighted images, etc.) and make my own personal DB of the content. Obviously that's a lot of effort for a little reward for one person, but if I created a repo with a set of tools for people to do this for themselves it could become a big legitimate source of "scraping" traffic.
But then I just made Twicsy fast enough to deal with the traffic so I don't need to worry about it anymore. I guess it depends on your business model whether or not that will work for you.
We actually found out who one of the worst ones was and contact them. It turns out it was a major legit proxy, but they had a bug in their proxy code that caused refetching of one of our urls over and over. They were very easy to work with and they fixed the bug.
behavioral modeling - rate limiting, bandwidth restrictions, etc
identity verifications - make sure they are running the browser they say they are, allow google and other search engins by whitelisting their IPs, block others that are pretending to be google, etc
code obfuscation - make it hard for them to scrape your code. Change up the CSS, etc.
OR you can use an automated service to do all this for you. Check out www.distil.it. Full disclosure, I'm the CEO of Distil.
Hunting down bots is a waste of time and effort better spent elsewhere.
If the crawler ignores your robots.txt, check it's name in your access logs. Often, people build things and set them loose without thinking about the consequences. Many crawlers have a homepage / programmer contact information somewhere on the web. Let them know they are hammering your website.
What is the rate at which requests are being made? Are they making 1000 requests per second? Downloading tons of images? You should probably just ignore it if it is less than 1 request per second.
s/Ask HN/Ask PG/