From the ABA: "Exceeds authorized access is defined in the Computer Fraud and Abuse Act (CFAA) to mean "to access a computer with authorization and to use such access to obtain or alter information in the computer that the accesser is not entitled so to obtain or alter."
To prove you have committed this terrible felony, the FBI will now demand that Apple assist in disabling the secure enclave of your device in order to access your browser history. But remember, they only need to do this because they aren't allow to MITM all TLS and "acquire" -- not "collect" -- every HTTP request your machine ever makes. </s>
general.useragent.override.netflix.com;Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:42.0) Gecko/20100101 Firefox/42.0
in about:config... what were they thinking?
Allow by IP range? You can probably find a somewhat accurate range for Google and whoever's crawlers.
Google was fortunately that no one sued them for these things before they got big enough to defend themselves. Many tech entrepreneurs haven't been so lucky.
You may ask why a big company like Google isn't doing more to change the CFAA or copyright law. The reason is now that they're big enough, legal grey areas like those in the CFAA (particularly, "what is unauthorized access?", because it's not defined by the statute) can be fully exploited, and Google can sit secure in the knowledge that they'll never be realistically challenged on it; meanwhile, they can then threaten potential competitors for doing the same thing, since a lawsuit against a public corporation takes 10 years and $5MM-$20MM. Anyone who could mount that kind of offense against Google won't, because they benefit from the grey area too; they'll just make some backroom deal with Google and not lose their lucrative, competition-destroying ability to do things that companies with sub-$100M revenues aren't able to do.
From the point of view of the law, that might not matter, but when there's a standardized way to make clear to bots that they're not welcome and you didn't bother to implement it, you'll look pretty silly if you complain.
If someone hasn't exploited a security bug (I mean a real security bug, like a buffer overflow, I don't consider behaviour such as serving up content to certain User-Agents only a genuine "security bug"), and they haven't bruteforced/cracked/acquired a password or private key, and they aren't sending unreasonable amounts of traffic ((D)DOS), it should not be a crime, and the law should be changed to reflect that principle. The law should reflect the common sense of the technically literate, but it doesn't, because it was written by the technically illiterate.
This is the spot we've reached through legislative meddling, and your best bet is either to be a good little consumer and don't make waves or do what you want and don't get caught. Neither of those seem to make a lot of sense in the long run.
So yes, congrats. It's jail for you -- but not for Google. Because they're Google, and you, well, you're not. You'd just better be happy we don't find out about you rooting your cellphone last year. Good grief.
ADD: I know you meant well, and I appreciated the </s> tag, but there was something I didn't like about your comment. Now I know what it is. By making this a big deal, you're increasing the likelihood that this poor schmuck becomes the next "example" some federal prosecutor decides to make. It's not your fault, but it still sucks. Let's hope that doesn't happen.
Edit: I was mis-remembering, the current law is against possession or manufacture of eavesdropping or wiretapping devices, not hacking tools. The EU has been playing with laws against hacking tools, but apparently nothing in the US yet against it.
The law makes it illegal to distribute devices (incl. software) that the design of such [software] renders it primarily useful for the purpose of the surreptitious interception of wire, oral, or electronic communications. Punishable by not more than 5 years and/or not more than $250,000. 18 U.S.C. 2512.
I don't think this blog post qualifies as an "interception" device,... however unauthorized retrieval and recording of another's voice mail messages constitutes an "interception" so who the hell knows. I'm sure you could find a US DA who would argue the falsified User-Agent meant the software is designed to "intercept" communication meant only for Google.
And by "may", I do mean "may". I don't know. But it's at least possible.
Here's how HTTP(S) works: I issue a REQUEST to the web server; the web RESPONDS with it, or denies it. It is up to the web server to respond or deny or do whatever it wants. If the web server is badly implemented or doesn't know what it's doing, it is the webserver's fault.
Remember: it's just a request. I can request 100 dollars from you; the fact that you give them to me does not make me a mugger.
So you go up to a server and lie to it, and it gives you something; is that not acquiring things through deception?
The structure of your argument suggests that e.g. breaking into an ssh server by issuing a login request with a known password which it responds to, isn't illegal. And further, that if data is acquired from the server, there is still no crime - the ssh protocol too is just requests to the server, it's all bits down the line. It's clearly nonsense.
Whether the lock was implemented poorly or you just didn't lock it — doesn't matter.
You're talking about a sign with practically no legal meaning.
This discussion is about a law which does have legal meaning.
To fix your analogy, it would be "If there's a law that says only explicitly authorized cars are allowed in a car wash; else 5 years in prison, and I have a car wash and say 'only red cars' ...". Of course, that still doesn't properly capture this since the intent of the law also matters and that law would be senseless, and so interpretation would be less obvious.
The better analogy is the one below about breaking into homes being illegal, but what if you happen to have a key that fits the lock? (though that also is a bad analogy in its own way).
Basically, making physical analogies for technical matters is rarely correct. It's often the best way to convince non-technical people of a matter without them needing to actually understand it.
The law says "authorized access" and the WSJ authorized Google to access their content in order to index it. The WSJ did not authorize that content to be presented, for free, to you, an end user, necessarily. I don't know which way a court would rule on it, but it's definitely not black-and-white.
Sure, it's technically similar, but the court doesn't care. The court doesn't care if the law makes no technical sense, because it's a law, not a program.
(See the second precept here: https://en.wikipedia.org/wiki/Five_Precepts)
Morally, I'm not sure.
1. If they allowed all bots but disallowed all non-bots, that would raise questions of what defines a "bot". Couldn't I write a personal bot that fetches the story for me? As a browser addon, even?
2. It's even more complex since allowing bots means they allow tools that provide the information to third parties, as the bots are not intended for private use by the bot maker. So the door is already open.
3. But in practice, it seems they favor certain bots. Is it ok that the WSJ lets Google do things Google's competitors cannot? Try to use the "web link" trick from HN on any other search engine, and it doesn't work in my experience. That seems anti-competitive and discriminatory in favor of the existing dominant entity in this space, Google.
Maybe, but I think its a pretty easy distinction. They aren't even allowing all bots - they're allowing a white list of them. You're not just writing your own bot to get around it, you're pretending to be someone else's bot.
> But in practice, it seems they favor certain bots. Is it ok that the WSJ lets Google do things Google's competitors cannot?
That's the really important question. I personally have no context for answering except to say that I can see both sides argued. If you view their website as a physical store / private establishment, then I assume that they have every right to establish who has access to what and under what conditions.
Of course, that hampers a lot of legitimate use cases along the way.
The only loss is the energy/bandwidth/cycles WSJ servers spent answering your request. Which, I believe, has been basis of computer "fraud" cases.
This can't be true. Surely the argument for why, say, a WSJ-paywall-bypassing-tool causes damage (in the legal sense) to WSJ is that it allows people who would otherwise pay for content to get it for free, thus depriving WSJ of income.
Moreover, I don't think prosecutors need to prove that you caused harm in order to charge you with computer fraud, since, for example, CFAA falls under criminal law.
More importantly, is violating "Google's rules" suddenly a violation of law?
I'm not taking, I am simply absorbing information. That information will still be there when I am done reading it. Have I really stolen, or did I just refuse to give someone money on demand?
What people choose to do with the information is another story...
See also: http://www.apple.com/customer-letter/
:)
'Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard-coded them, so you must run a DNS lookup as described next.'
[1]: https://support.google.com/webmasters/answer/66355?hl=en
https://support.google.com/news/publisher/answer/74288?hl=en
> News-specific tag definitions
> Yes, if access is not open, else should be omitted
> Possible values include "Subscription" or "Registration", describing the accessibility of the article. If the article is accessible to Google News readers without a registration or subscription, this tag should be omitted.
EDIT: The "paste a headline into Google" trick still works for me, though. If this continues to be the case, they will keep indexing, of course.
So people can find it? I'd be pissed if Google de-indexed something like IEEE because it has a paywall.
Assuming the internet has to be freely available is a mistake. Especially with the continued growth of an adblocked internet. We could be facing an internet with significant paywalls in the future.
I'd support a "free" search term to weed out paywalled results.
Furthermore, Google shouldn't be making normative judgements about what people should see. It's an abuse of their monopoly.
WSJ is free to institute a full paywall and only serve snippets to Google. They might now like what it does to their rankings though.
What they cannot do is continue to sniff the UA before deciding to put up the paywall. (Though I'm still able to use the Google trick, so it seems the experiment might have ended.)
And yes, this violates Google's policies laid out explicitly at https://support.google.com/news/publisher/answer/40543?hl=en
I then pasted the headline into google and clicked on it from Google results and did not get hit by the paywall.
Take a look, for instance, at the WSJ.com home page with an ad blocker turned on (note all the missing letters and scrambled up titles). They want me to pay, and they want me to see ads, and they want to track my behavior? Should I send them my DNA also?
Organizations like WSJ are exactly the disease that causes ad blockers to proliferate and ruin the web for all the decent publishers. They're at war with my privacy (by breaking their site intentionally when I visit with a blocker on). They want it all, ads, tracking, your private data, and subscription revenue, not to mention...
# Agenda-Driven Content
I mean, we're basically talking about NBC or Fox here, just on the web. Imagine every morning when you woke up you turned on the television and tune to some "news" show. After talking about the weather, they start talking about a lost pickle that is thought to be potentially alive and moving about with free will. Over the next two years, talk about the same pickle extends to every other TV show. Before you know it, everybody in the nation is talking about the same pickle. Years go by, and that pickle has become a part of our society, and that's not because people are born with an innate care the well-being of pickles, but because "news" shows taught them to be.
That's not a good position to be in. I have to believe I'm not the only one in here that doesn't watch any TV. So, why do we all treat the same media giants differently on the web? We crave their content so much that we build browser add-ons to get to their content, etc.
You aren't entitled to WSJ.com, NBC, or Fox.
I don't actually see what you're referring to; maybe its because I get redirected to http://www.wsj.com/europe. Maybe I have a different ad-blocker. Either way, it reminds me somewhat of NME's [1] homepage (New Musical Express, a popular music publication; not sure if it's really known outside of the UK). They deliver their images in such a way that they fall foul of my ad-blocker, although I haven't looked in enough depth to be certain whether this is a way of preventing ad-blockers, or purely unintentional.
Uh, what? Using uBlock origin, when I visit wsj.com I get what looks like a perfectly normal page. Nothing is scrambled at all.
That being said I do enjoy their content, save for maybe the op-eds.
They all seem to want to sell subscriptions, which are perpetual and probably difficult to cancel..
The pricing here is much too aggressive
I don't know if it's available in the US yet but they are at least planning to launch in the near future.
Otherwise, why would expertsexchange be obligated to provide the answers at the very bottom? Did something change?
Those signatures could obviously leak, but on a per-domain basis. Perhaps the domains could have a secure way of bumping the valid key generation if they had a leak.
First, they don't want to. In fact, if a search engine can figure out that a link is going to lead to a paywall, they'll probably want to reduce the ranking of the result, because the user is not going to want results they can't actually look at.
Second, it would be a massive antitrust violation because it would prevent access by competing crawlers. The only way around that is to allow access to anyone who claims they're a crawler, which was the original problem.
https://support.google.com/webmasters/answer/80553?hl=en
AFAIK no content provider actually does this check though.
Also, isn't it illegal to bypass computer security?
Their server can choose to do what it wants with your request and you can choose what to do with the response it sends.
Are User-Agent headers legally protected identities?
Then again, lots of sci-fi dystopias are dreams of an automated law that somehow destroys the fabric of society, so...
I actually made a bookmarklet with the following pasted into the URL, so you can do it in a single click:
javascript:void(open('https://archive.is/?run=1&url='+encodeURIComponent(document....)
I've used them to save Facebook posts before, and the pages were logged in to some "Nathan" IIRC. They probably have a bunch of hacks for specific sites that needed fixing.
That is completely idiotic if there is a string you can put in a Mozilla browser config that is literally illegal to browse the web with.
I thought that google deemed providing search results which were behind paywalls as a "bad experience" for their search users, and would penalize websites for doing so.
Is this no longer the case?
For the second point, Google does require that publishers specify "registration required" in their sitemap.
Also, Google has published IP addresses it uses, so this extension might not last long...
So does it mean that Google will no longer index full WSJ articles or does it mean a change in the Google's policy?
Since this is billed as an "experiment" I'm guessing that WSJ is just testing the waters. If they roll it out to everyone, they will have to serve only snippets to Google or risk getting delisted.
[1] https://support.google.com/news/publisher/answer/40543?hl=en
I think the relevant point, underscored by the author's last sentence, is it doesn't matter who you open a back door for - it opens the possibility for anyone to barge through.
2. The idea that this is somehow new is wrong. The way for a server to identify crawlers have "always" been to look at the user-agent, and, when done right, IP, verified either by net block owner or by doing PTR lookup and then checking that the A or AAA record for the claimed host points back at the same IPv4 or IPv6 address. Meanwhile, I do agree that paywalling is a more recent phenomenon, at least with regards to the extend it is popular among sites today, but the concept of presenting different data to crawlers and visitors arose much earlier and is something Google have been aware of and has made sure to delist such sites when found, whereas in fact Google has since then moved abit in the direction of allowing it in that they do so for Google News if declared as explained by others ITT.
So in my view, it seems that the author is jumping to incorrect conclusions based on an incomplete understanding of what's actually going on here. What then about the HN readership, how come this article became so highly voted and I don't see these issues raised by anyone else? Or maybe I'm just crazy?
Don't nitpick. It's just a shortened version of How To "Be" a Google’s Web Crawler to Bypass Paywalls. You get it. I get it. Everyone gets it.
Additionally, I would like to point out that I wrote a Varnish extension for the express purpose of validating User-Agent strings through DNS lookups, and is available here: https://github.com/knq/libvmod-dns
It was built because we had specifically a problem with bad bots crawling a large site (multiply.com) and this was one of the easiest ways to filter out the bad bots from the good, and to enforce robots.txt policies on a per bot basis. It works very well, as you can do any kind of DNS caching internally and prevent this kind of behavior, if that's your goal.
Are they running afoul of Google policies and going to get pinged by Google?
I can't find the text from Google now (when can you ever find any docs at google?), but I am very certain I remember reading from them that you may not return different content to GoogleBot based on User-Agent.
coughNSAcough
Not an SEO Expert here, but wonder how and whether Google will end up handling that. I mean making an exception could also be considered abuse of power in some countries of the world. Don't have any strong opinion yet on that, just saying that because of how the EU exercised certain laws in recent years.
User-agents are notoriously unreliable.
Content providers register a (yet-to-be-written) Google News API account, get an API key, with which Google indexes the site and the site recognizes as legit.
Great idea here guys