The entire point of protocols is to precisely define the terms of communication. The status code is '200 OK', not '200 OK/Asterisk'. But of course if lawlers didn't force themselves into the situation, they'd be out of jobs.
As an aside, I'd really like to see a browser plugin that would scrape sites in the normal course of access, storing the proceeds in a distributed public database.
This would be copyright infringement, since the content of the page is a substantive unique work that is automatically copyrighted by its author. A site that doesn't want you scraping its content is not going to want you posting dumps of its pages. Much like BitTorrent, they'd get into the protocol and send subpoenas to the ISPs behind the IPs that serve their pages, and use that info to sue the customer.
When my company was shut down by a legal threat related to scraping, I did suggest to my lawyer that we create something like a browser extension that would grab the data we needed out of normal client-side browsing sessions. This wouldn't be as nice as controlling the flow of information ourselves but it would've worked OK. My lawyer strongly suggested avoiding that as it could've been construed as conspiratorial conduct that would've made criminal prosecution more likely.
Perhaps this illustrates the fungibility of the legal system: it's an inherently human construct that pits a plaintiff against a defendant, and given a big enough warchest and persuasive-enough arguments, catastrophe can be avoided -- by Google; perhaps not by you, me, or someone else.
I think imbuing technical protocols with legal implications would be even worse than the current situation since then changing anything on a protocol would require changing the law and getting a protocol implementation slightly wrong would carry real-world legal repercussions on the order of licensing your work in the public domain rather than retaining copyright. Let the lawyers make the law and check the human terms of service before using the data. Trying to out-lawyer the lawyers is like challenging a hedgehog to a butt-kicking brawl.
"The User-Agent request-header field contains information about the user agent originating the request. This is for [...] the tracing of protocol violations [...]. User agents SHOULD include this field with requests"
Many scrapers disregard this part of the protocol. Of course, whether a headless browser should send a different UA is an interesting question.
If you're entering a country, do its laws not apply to you until you've seen a copy of them? "Oh, sorry, no one told me theft is illegal here. Where does it say that? Oh, I see. Okay. I'll stop now. Thanks for letting me know."
If you cross the border without necessary documents, does that country have no right to detain you, simply because you haven't checked the laws?
Just because a website is visible and public doesn't mean its content is public domain. It just means that your first order of business as a user should be to check the terms of service. Sure, most people using a website probably don't need to--same as not needing to check a country's stance on murder--and so can just use the website as intended without violating the terms. But when you plan on using it in a way that might not be intended, and you don't check the terms of service, well, that's on you.
Also we don't even know how many laws there are in the United States for I'd say knowing the content is impossible.
I don't need to check your terms of services if I'm doing something that I'm allowed to do by law anyway; the TOS cannot deny me those rights (they might, of course, grant me additional rights provided that I follow certain conditions).
So I'm not so sure that police will escort you out of a Walmart because they caught you taking a picture of the parking lot with your smartphone.