I can’t speak for other sites, but we’re pretty good at picking up on crawlers that don’t have a unique UA. The problem is that we’re going to have a hard time differentiating your well-behaved crawler from more malicious crawlers, and you’re going to get caught in the crossfire.
> if you can get away with raw HTTP requests grabbing and parsing the HTML without pulling down stylesheets, JS, etc. do it.
If you combine that with the lack of an identifying UA, there’s unfortunately a good chance you’ll get caught in the crossfire during an actual attack. That being said, it’s good advice otherwise. If you’re trying not to be identified as a crawler, it’s really going to stand out, though.
> I would never expect a sysadmin to contact me because frankly they aren't paid to.
I am. Furthermore, as long as you’re being transparent about your activity (see: UA), I don’t mind working with you instead of your provider. I understand that writing good crawlers is a learning experience; mistakes do happen. When I send abuse reports, usually people just get a slap on the wrist, but not everyone is that lucky.
But, if your UA has contact info, I can:
1. Easily rate limit or block you until the issue is resolved
2. Contact you directly, explaining exactly what’s wrong
3. Easily unblock you once it’s fixed
Sure, I’m not going to be happy about it, but I’m going to be a lot happier than if you try to blend in—a situation in which I’m not going to have any sympathy.
Unfortunately, most sites don’t respond that way and would rather just block anything remotely suspicious. But since you can always change your IP address, maybe try with an identifiable UA first—please? :)
Edit: Also, a few recommendations to add:
1. Be prepared to handle obscure HTTP status codes. 503 indicates you need to back off. Frequent 500, 502, or 504 means the same thing. 429 and 420 mean you’re being rate limited; slow down. 410 means you should stop requesting the given URL. 400 or 405 means you probably have a bug. Any unrecognized 4XX or 5XX error should be flagged and examined so you can handle it better in the future.
2. You can send an X-Abuse-Info header and a generic UA if you want capable sysadmins to be able to identify you but want to avoid being blocked by inexperienced webmasters.
3. Don’t ignore abuse reports.
4. Try to be consistent and ramp up slowly. It’s harder to cope with unnaturally-abrupt increases in traffic.