How Google’s Web Crawler Bypasses Paywalls (opens in new tab)

(elaineou.com)

640 pointselaineo10y ago232 comments

232 comments

141 comments · 44 top-level

zaroth10y ago· 25 in thread

And congratulations, you have likely just "exceeded authorized access" and committed a felony violation of the CFAA punishable by a fine or imprisonment for not more than 5 years under 18 U.S.C. § 1030(c)(2)(B)(i).

From the ABA: "Exceeds authorized access is defined in the Computer Fraud and Abuse Act (CFAA) to mean "to access a computer with authorization and to use such access to obtain or alter information in the computer that the accesser is not entitled so to obtain or alter."

To prove you have committed this terrible felony, the FBI will now demand that Apple assist in disabling the secure enclave of your device in order to access your browser history. But remember, they only need to do this because they aren't allow to MITM all TLS and "acquire" -- not "collect" -- every HTTP request your machine ever makes. </s>

jrockway10y ago

User agent strings have a long history of being intentionally misleading. IE 11 claims to be "Mozilla/5.0". Chrome claims to be "Safari/537.36". The User-Agent string is all lies, and has been ever since the first site started doing UA sniffing.

zaroth10y ago

It's intent that matters. Setting user-agent in order to properly render a page is legal. Setting a user-agent string to gain access to otherwise unauthorized content is probably not.

4 more replies

antsar10y ago

A decent Friday afternoon read on that topic:

http://webaim.org/blog/user-agent-string-history/

jd310y ago

I couldn't believe my eyes when I saw that netflix started sniffing the UA. You can't even play a video in SeaMonkey 2.39 without the incredibly stupid

general.useragent.override.netflix.com;Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:42.0) Gecko/20100101 Firefox/42.0

in about:config... what were they thinking?

dpweb10y ago

Pretty obvious weak move restricting by UA$.

Allow by IP range? You can probably find a somewhat accurate range for Google and whoever's crawlers.

cookiecaper10y ago

Google's _entire business_ is based not only on accessing servers without authorization, frequently in explicit violation of that site's terms of use because ToU boilerplate includes languages excluding all "crawlers, bots, or spiders", but also on flagrant violation of copyright law. They save the entire text of the web page on their servers when they crawl it (unlicensed copying), rehost it in Google Cache (unlicensed redistribution), save all the images and rehost them on Google Images (both), and so forth. All of this is absolutely illegal under current copyright law.

Google was fortunately that no one sued them for these things before they got big enough to defend themselves. Many tech entrepreneurs haven't been so lucky.

You may ask why a big company like Google isn't doing more to change the CFAA or copyright law. The reason is now that they're big enough, legal grey areas like those in the CFAA (particularly, "what is unauthorized access?", because it's not defined by the statute) can be fully exploited, and Google can sit secure in the knowledge that they'll never be realistically challenged on it; meanwhile, they can then threaten potential competitors for doing the same thing, since a lawsuit against a public corporation takes 10 years and $5MM-$20MM. Anyone who could mount that kind of offense against Google won't, because they benefit from the grey area too; they'll just make some backroom deal with Google and not lose their lucrative, competition-destroying ability to do things that companies with sub-$100M revenues aren't able to do.

roywiggins10y ago

If those sites don't want spiders, they can just specify that in robots.txt, which Google honors, right?

From the point of view of the law, that might not matter, but when there's a standardized way to make clear to bots that they're not welcome and you didn't bother to implement it, you'll look pretty silly if you complain.

dosaygo10y ago

Okay, there are other aspects to Google's business beside information retrieval and search, tho the point that the access is mostly unauthorized is valid. Although the damage done by this is to some extent offset by Google's status essentially as a public utility: it universally provides a social good, "search", and the tax we pay is advertising. Your frustration for the "little guys" is mostly en pointe, since the damage done is rendered then mostly to competitors, rather than to customers, and the argument, true or no, that Google's search is "lightyears" ahead of other possible offerings, to some extent offsets the damage to consumers due to loss of competition through the possibly anti-competitive practices you highlight. So it seems the balance of good is in the favor of consumers of "search". I think this is the main force, rather than any "structural obstacles" to competition, which is the cause of the persistence of the status quo in this market. The thing about this which no one seems to see is that, since everyone is taking information dishonestly, there's a huge opportunity to actually "bring it into the light", do it honestly, and strike some kind of deals with content creators.

skissane10y ago

Doesn't this just prove that CFAA (and similar laws in other jurisdictions) is massively over-broad and needs to be narrowed. I'm surprised there is not more campaigning for CFAA repeal and replacement with far narrower legislation.

If someone hasn't exploited a security bug (I mean a real security bug, like a buffer overflow, I don't consider behaviour such as serving up content to certain User-Agents only a genuine "security bug"), and they haven't bruteforced/cracked/acquired a password or private key, and they aren't sending unreasonable amounts of traffic ((D)DOS), it should not be a crime, and the law should be changed to reflect that principle. The law should reflect the common sense of the technically literate, but it doesn't, because it was written by the technically illiterate.

swanson10y ago

I wonder if a user-agent that was something like "Not a Googlebot" would a) allow access (probably regex based) and b) be truthful/plausible deniability.

oneeyedpigeon10y ago

What is your purpose in setting your user-agent string to that value, other than cleverly bypassing the paywall?

1 more reply

DanielBMarkham10y ago

As this entire thread has pointed out, the law makes no sense at all here. So you might as well say "Congratulations! You're going to jail! And I don't have to tell you why!" because none of the details matter anyway.

This is the spot we've reached through legislative meddling, and your best bet is either to be a good little consumer and don't make waves or do what you want and don't get caught. Neither of those seem to make a lot of sense in the long run.

So yes, congrats. It's jail for you -- but not for Google. Because they're Google, and you, well, you're not. You'd just better be happy we don't find out about you rooting your cellphone last year. Good grief.

ADD: I know you meant well, and I appreciated the </s> tag, but there was something I didn't like about your comment. Now I know what it is. By making this a big deal, you're increasing the likelihood that this poor schmuck becomes the next "example" some federal prosecutor decides to make. It's not your fault, but it still sucks. Let's hope that doesn't happen.

pbosko10y ago

How does that law apply to foreigners?

ec10968510y ago

Who has? Surely not the person who wrote the tutorial.

zaroth10y ago

Even worse, the poor author has created a hacking tool capable of enabling said felony, which I believe could get them 10 or 20 years... I'm looking for the statute now.

Edit: I was mis-remembering, the current law is against possession or manufacture of eavesdropping or wiretapping devices, not hacking tools. The EU has been playing with laws against hacking tools, but apparently nothing in the US yet against it.

The law makes it illegal to distribute devices (incl. software) that the design of such [software] renders it primarily useful for the purpose of the surreptitious interception of wire, oral, or electronic communications. Punishable by not more than 5 years and/or not more than $250,000. 18 U.S.C. 2512.

I don't think this blog post qualifies as an "interception" device,... however unauthorized retrieval and recording of another's voice mail messages constitutes an "interception" so who the hell knows. I'm sure you could find a US DA who would argue the falsified User-Agent meant the software is designed to "intercept" communication meant only for Google.

2 more replies

jerf10y ago

Under CFAA, I don't know. The DMCA may have some problems with that blog post, though.

And by "may", I do mean "may". I don't know. But it's at least possible.

1 more reply

1024core10y ago

I know you were kidding, but I still don't understand this "exceeds access" bullshit.

Here's how HTTP(S) works: I issue a REQUEST to the web server; the web RESPONDS with it, or denies it. It is up to the web server to respond or deny or do whatever it wants. If the web server is badly implemented or doesn't know what it's doing, it is the webserver's fault.

Remember: it's just a request. I can request 100 dollars from you; the fact that you give them to me does not make me a mugger.

barrkel10y ago

What if you tell me my car's broken when it's not and request 100 dollars from me to fix it? That would be fraud.

So you go up to a server and lie to it, and it gives you something; is that not acquiring things through deception?

The structure of your argument suggests that e.g. breaking into an ssh server by issuing a login request with a known password which it responds to, isn't illegal. And further, that if data is acquired from the server, there is still no crime - the ssh protocol too is just requests to the server, it's all bits down the line. It's clearly nonsense.

maxerickson10y ago

If you request the 100 dollars with a specially crafted piece of paper that I glance at and believe is legitimate, you are committing a crime (check fraud). Which whatever, let's not focus on the analogy. Whether you like it or not, the law doesn't necessarily view the valid server response as authorization to access the requested url, it examines what you were thinking when you created the url.

prebrov10y ago

And one more car analogy. If someone comes up to your car, pulls the door (issues a request) and based on the fact that it opens (responds 200), drives away in it, that would be seen as grand theft auto.

Whether the lock was implemented poorly or you just didn't lock it — doesn't matter.

fiatmoney10y ago

If I run a car wash with a sign, "red cars washed free", and you paint your car red, have you defrauded me? You're not really a red car, you're just pretending!

wmeredith10y ago

Oh pish posh. If you forge a key that fits the lock on my front door, yeah, your breaking the law. The lock itself is not an advertisement to come in if you can unlock it.

1 more reply

TheDong10y ago

Terrible analogy.

You're talking about a sign with practically no legal meaning.

This discussion is about a law which does have legal meaning.

To fix your analogy, it would be "If there's a law that says only explicitly authorized cars are allowed in a car wash; else 5 years in prison, and I have a car wash and say 'only red cars' ...". Of course, that still doesn't properly capture this since the intent of the law also matters and that law would be senseless, and so interpretation would be less obvious.

The better analogy is the one below about breaking into homes being illegal, but what if you happen to have a key that fits the lock? (though that also is a bad analogy in its own way).

Basically, making physical analogies for technical matters is rarely correct. It's often the best way to convince non-technical people of a matter without them needing to actually understand it.

2 more replies

hughes10y ago

Is that significantly different from viewing google's cached result of their legitimate access? The result is the same.

TheDong10y ago

The difference is intent!

The law says "authorized access" and the WSJ authorized Google to access their content in order to index it. The WSJ did not authorize that content to be presented, for free, to you, an end user, necessarily. I don't know which way a court would rule on it, but it's definitely not black-and-white.

Sure, it's technically similar, but the court doesn't care. The court doesn't care if the law makes no technical sense, because it's a law, not a program.

1 more reply

mbroshi10y ago· 9 in thread

Am I alone in feeling like this is akin to a tutorial on how you can shoplift without getting caught? WSJ, for better or worse, does not want to give you content without your paying for it. If you take that content without paying, you are stealing. Just because you have figured out how to get past their security does not mean it's not stealing.

(See the second precept here: https://en.wikipedia.org/wiki/Five_Precepts)

azakai10y ago

Legally it might, I don't know.

Morally, I'm not sure.

1. If they allowed all bots but disallowed all non-bots, that would raise questions of what defines a "bot". Couldn't I write a personal bot that fetches the story for me? As a browser addon, even?

2. It's even more complex since allowing bots means they allow tools that provide the information to third parties, as the bots are not intended for private use by the bot maker. So the door is already open.

3. But in practice, it seems they favor certain bots. Is it ok that the WSJ lets Google do things Google's competitors cannot? Try to use the "web link" trick from HN on any other search engine, and it doesn't work in my experience. That seems anti-competitive and discriminatory in favor of the existing dominant entity in this space, Google.

mankyd10y ago

> If they allowed all bots but disallowed all non-bots, that would raise questions of what defines a "bot".

Maybe, but I think its a pretty easy distinction. They aren't even allowing all bots - they're allowing a white list of them. You're not just writing your own bot to get around it, you're pretending to be someone else's bot.

> But in practice, it seems they favor certain bots. Is it ok that the WSJ lets Google do things Google's competitors cannot?

That's the really important question. I personally have no context for answering except to say that I can see both sides argued. If you view their website as a physical store / private establishment, then I assume that they have every right to establish who has access to what and under what conditions.

Of course, that hampers a lot of legitimate use cases along the way.

1 more reply

njharman10y ago

No content is being given or taken. This is restriction on distribution / copying. Massively different ethics than stealing despite what the Copyright Cabal wants you to think.

The only loss is the energy/bandwidth/cycles WSJ servers spent answering your request. Which, I believe, has been basis of computer "fraud" cases.

zodiac10y ago

> Which, I believe, has been basis of computer "fraud" cases.

This can't be true. Surely the argument for why, say, a WSJ-paywall-bypassing-tool causes damage (in the legal sense) to WSJ is that it allows people who would otherwise pay for content to get it for free, thus depriving WSJ of income.

Moreover, I don't think prosecutors need to prove that you caused harm in order to charge you with computer fraud, since, for example, CFAA falls under criminal law.

2 more replies

ikeboy10y ago

WSJ is violating Google's rules referenced here: https://support.google.com/news/publisher/answer/40543?hl=en

mod10y ago

So? Is this a "two wrongs make a right?"

More importantly, is violating "Google's rules" suddenly a violation of law?

2 more replies

RIMR10y ago

> abstain from taking what is not given

I'm not taking, I am simply absorbing information. That information will still be there when I am done reading it. Have I really stolen, or did I just refuse to give someone money on demand?

elaineoOP10y ago

I don't disagree with you. But, given the nature of this forum, I think that the information content has merit.

What people choose to do with the information is another story...

jhildings10y ago

Stealing would imply that they no longer are in possession or in control of the content, which of course is false. Copying illegaly however, yes

lloyddobbler10y ago· 8 in thread

"Remember: Any time you introduce an access point for a trusted third party, you inevitably end up allowing access to anybody."

m52go10y ago

I came back here to post this line! It's so perfect.

matheweis10y ago

I was going to post here that there were ways around that; proper security, cryptographic access control... And then I saw the light. ;)

elaineoOP10y ago

Haha! Thanks for catching that :)

NKCSS10y ago

Google also specifies the ip ranges of their boys; just UA checking is sloppy

queeerkopf10y ago

Hmmm, the following seems to contradict that; instead google recommends verification by DNS lookup: https://support.google.com/webmasters/answer/80553

'Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard-coded them, so you must run a DNS lookup as described next.'

2 more replies

melted10y ago

Except of course what FBI is proposing wouldn't give access to "anybody". Just unfettered access for FBI, CIA and NSA by way of gag orders and national security letters, forcing Apple to break security of their hardware and not speak about it publicly. Of note is the fact that it wouldn't even give direct access to FBI, since they don't have the firmware keys (yet).

kbenson10y ago

In this example, the FBI is the "trusted third party", but by giving them access, we inevitably open access for everyone, as the system is no longer strongly secure. The trusted third party in the quote isn't asking for access for everybody either, but in the end that's what happens.

1 more reply

interpol_p10y ago

Yeah but if you don't trust those companies to be secure and to maintain that security throughout all their employees and contractors (see: Snowden). Then you must assume giving someone access is the same as giving everyone access.

slig10y ago· 6 in thread

If they're now blocking clicks from Google, doesn't that mean that they're cloaking and violating the Google's Webmaster Guidelines [1]?

[1]: https://support.google.com/webmasters/answer/66355?hl=en

elaineoOP10y ago

Google is not okay with cloaking, but they will whitelist publishers if the publisher specifically includes a parameter that declares if the site requires registration or subscription. This is done in the sitemap.

https://support.google.com/news/publisher/answer/74288?hl=en

ikeboy10y ago

https://support.google.com/news/publisher/answer/40543?hl=en seems to specifically ban this. WSJ is in violation, not fitting any of the categories there.

slig10y ago

I didn't know that, thanks. But reading, it seems that option is about Google News, not the main Google Search.

> News-specific tag definitions

> Yes, if access is not open, else should be omitted

> Possible values include "Subscription" or "Registration", describing the accessibility of the article. If the article is accessible to Google News readers without a registration or subscription, this tag should be omitted.

pacquiao88210y ago

WSJ is big enough to negotiate their own terms with Google Search.

1 more reply

CM3010y ago

Yes, and I suspect that's what going to get them to change it back again. After a ban from Google and the traffic drop it'll likely bring, that paywall is likely coming right back down. An awful lot of media companies made similar mistakes before, and it's always ended with them quickly removing their 'work arounds'.

Implicated10y ago

Yea, except they're not going to get that ban.

metafunctor10y ago· 5 in thread

I'm pretty sure Google will soon stop indexing WSJ. Why index something if the vast majority of users cannot access the pages behind the links?

EDIT: The "paste a headline into Google" trick still works for me, though. If this continues to be the case, they will keep indexing, of course.

rhino36910y ago

>Why index something if the vast majority of users cannot access the pages behind the links?

So people can find it? I'd be pissed if Google de-indexed something like IEEE because it has a paywall.

Assuming the internet has to be freely available is a mistake. Especially with the continued growth of an adblocked internet. We could be facing an internet with significant paywalls in the future.

I'd support a "free" search term to weed out paywalled results.

Furthermore, Google shouldn't be making normative judgements about what people should see. It's an abuse of their monopoly.

morgante10y ago

Google doesn't forbid or de-index paywalls. What they forbid is cloaking (showing Google different content than what users will see). This is, of course, quite critical to maintaining search quality.

WSJ is free to institute a full paywall and only serve snippets to Google. They might now like what it does to their rankings though.

What they cannot do is continue to sniff the UA before deciding to put up the paywall. (Though I'm still able to use the Google trick, so it seems the experiment might have ended.)

ikeboy10y ago

It doesn't work for me. They did say they're "testing" it, so maybe A/B testing conversion rates.

And yes, this violates Google's policies laid out explicitly at https://support.google.com/news/publisher/answer/40543?hl=en

Joky10y ago

Indexing is fine, a great feature would be if Google was able to show it only to the user that can access it.

jsprogrammer10y ago

Why should Google manage WSJ's paywall?

1 more reply

jgh10y ago· 5 in thread

I just tried clicking on "Harper Lee, Author of ‘To Kill a Mockingbird,’ Dies at Age 89" from wsj.com's homepage and got the paywall.

I then pasted the headline into google and clicked on it from Google results and did not get hit by the paywall.

zem10y ago

they're probably doing some sort of a/b testing by selectively letting some clicks through

adamrights10y ago

This is basically true ^^

1 more reply

e4010y ago

Mine did, which surprised me, and I did exactly the same thing.

creativityhurts10y ago

I did the same with the same article and got paywalled from Google results.

gsibble10y ago

I also checked the paywall and the old trick still works for me. Odd.

mangeletti10y ago· 4 in thread

This is not meant to be purely controversial, but I thought long and hard about WSJ back a few months ago when HN mod (always forget his name) said to stop complaining about HN links being posted because paywalls were ok. I agree paywalls are ok. But some things are not ok.

Take a look, for instance, at the WSJ.com home page with an ad blocker turned on (note all the missing letters and scrambled up titles). They want me to pay, and they want me to see ads, and they want to track my behavior? Should I send them my DNA also?

Organizations like WSJ are exactly the disease that causes ad blockers to proliferate and ruin the web for all the decent publishers. They're at war with my privacy (by breaking their site intentionally when I visit with a blocker on). They want it all, ads, tracking, your private data, and subscription revenue, not to mention...

# Agenda-Driven Content

I mean, we're basically talking about NBC or Fox here, just on the web. Imagine every morning when you woke up you turned on the television and tune to some "news" show. After talking about the weather, they start talking about a lost pickle that is thought to be potentially alive and moving about with free will. Over the next two years, talk about the same pickle extends to every other TV show. Before you know it, everybody in the nation is talking about the same pickle. Years go by, and that pickle has become a part of our society, and that's not because people are born with an innate care the well-being of pickles, but because "news" shows taught them to be.

That's not a good position to be in. I have to believe I'm not the only one in here that doesn't watch any TV. So, why do we all treat the same media giants differently on the web? We crave their content so much that we build browser add-ons to get to their content, etc.

Laaw10y ago

If they make sending your DNA a requirement of consuming their content, then yes, you send it to them if you want their content. That's their right, as owners of something, to dictate its use.

You aren't entitled to WSJ.com, NBC, or Fox.

mangeletti10y ago

I'm not a collectivist, nor do I believe I'm "entitled" to anything, including human rights. I'm simply stating that "I don't support organizations like WSJ"... as you can clearly see, if you read my comment. I don't propose anyone ban them. I simply "don't know why we [as a society] support them", also clearly in my comment. My point, which your political diatribe disallowed you to notice, was that we're fighting awful hard to consume their garbage. Meanwhile, there is plenty of content out there there is free, not because of socialism or entitlement, but for the same reason as blockers became popular... because information is so readily available now that nobody is ready to send their DNA in.

oneeyedpigeon10y ago

> Take a look, for instance, at the WSJ.com home page with an ad blocker turned on (note all the missing letters and scrambled up titles).

I don't actually see what you're referring to; maybe its because I get redirected to http://www.wsj.com/europe. Maybe I have a different ad-blocker. Either way, it reminds me somewhat of NME's [1] homepage (New Musical Express, a popular music publication; not sure if it's really known outside of the UK). They deliver their images in such a way that they fall foul of my ad-blocker, although I haven't looked in enough depth to be certain whether this is a way of preventing ad-blockers, or purely unintentional.

[1] http://www.nme.com/

0xffff210y ago

>Take a look, for instance, at the WSJ.com home page with an ad blocker turned on (note all the missing letters and scrambled up titles)

Uh, what? Using uBlock origin, when I visit wsj.com I get what looks like a perfectly normal page. Nothing is scrambled at all.

anewhnaccount210y ago· 3 in thread

If this is true, what WSJ is doing is called "cloaking" and should cause it to get de-indexed: https://support.google.com/webmasters/answer/66355?hl=en

vonklaus10y ago

conversely, everyone should actively cloak and use random generated numbers to dynamically serve variants of their content similar to how mapmakers use trap streets. That way, like, another company wouldn't be profiting directly off of their work and threatening to sort of, destroy their entire business if you disagreed.

jonknee10y ago

What? Any site can very easily not be in Google if they choose to. It's a very dumb decision for a news site, but you're free to do it.

1 more reply

jacalata10y ago

Ha yea, why let Google have the power to destroy your business when you can burn it to the ground yourself!

matt_wulfeck10y ago· 3 in thread

I like wsj but I only read maybe 1 article every other day. They need a more reasonable price point, especially since the market will almost bear no price at all.

That being said I do enjoy their content, save for maybe the op-eds.

acomjean10y ago

I'm surprised that most online papers won't sell you one day's worth online for a buck or so. Like buying a real newspaper.

They all seem to want to sell subscriptions, which are perpetual and probably difficult to cancel..

matt_wulfeck10y ago

Even a dollar a day. I can pay Netflix $10 a month and stream unlimited HD video, but wsj wants $30 to read the first few paragraphs of a few articles a day?

The pricing here is much too aggressive

1 more reply

germanier10y ago

You can buy single WSJ print or online articles for 29 cents on Blendle or today's whole paper for 3.20.

I don't know if it's available in the US yet but they are at least planning to launch in the near future.

crazysim10y ago· 3 in thread

Doesn't this kind of also hurt SEO? I'm would guess Google has some automated system to detect and apply a negative signal to sites that provide different content to a Googlebot user agent than a non-Googlebot user agent. I guess these sites are counting that the other signals outweigh that negative hit.

Otherwise, why would expertsexchange be obligated to provide the answers at the very bottom? Did something change?

eli10y ago

I'm 99% sure I've encountered a Googlebot crawling pages with the UA of a regular browser, presumably for exactly this purpose.

chinathrow10y ago

They do this via an iPhone like UA, though googlebot is still in there too.

1 more reply

skocznymroczny10y ago

expert sex change

jdunck10y ago· 3 in thread

If Google (or any other crawler) wanted to play nice with paywalls, they could issue a public key for their bot, and put a signature in their User Agent string that the domain could then verify.

Those signatures could obviously leak, but on a per-domain basis. Perhaps the domains could have a secure way of bumping the valid key generation if they had a leak.

AnthonyMouse10y ago

There are two problems with this.

First, they don't want to. In fact, if a search engine can figure out that a link is going to lead to a paywall, they'll probably want to reduce the ranking of the result, because the user is not going to want results they can't actually look at.

Second, it would be a massive antitrust violation because it would prevent access by competing crawlers. The only way around that is to allow access to anyone who claims they're a crawler, which was the original problem.

LunaSea10y ago

The current situation with the WSJ could already be considered an antitrust violation. It's whitelisting one crawler and leaving the other ones out.

desdiv10y ago

Google (and every other major search engine) already provide a way, i.e. reverse DNS lookup, to authentic bot ownership:

https://support.google.com/webmasters/answer/80553?hl=en

AFAIK no content provider actually does this check though.

mchahn10y ago· 3 in thread

Bypassing the paywall is more unethical that blocking ads. It is one thing to have control over your own browser but another to steal something from another site.

Also, isn't it illegal to bypass computer security?

nkrisc10y ago

How is modifying your own request headers any different than choosing to not display content returned in the response body?

Their server can choose to do what it wants with your request and you can choose what to do with the response it sends.

Are User-Agent headers legally protected identities?

drostie10y ago

Unfortunately, yes. The law legally protects anything which you might use to gain unauthorized access; that includes e.g. a password field. (That is, it is indeed breaking into a system if you type in the correct password, say by reading it on a post-it on someone's monitor, but you are not supposed to know that password.) This sort of thing makes the entire business of law complicated and impossible to automate.

Then again, lots of sci-fi dystopias are dreams of an automated law that somehow destroys the fabric of society, so...

effie10y ago

The difference is substantial; in principle, one can modify his request to intentionally bypass the authorization mechanism occurring on the server. One cannot mislead anyone/anything by displaying his data in a customized way in private on his computer.

1 more reply

ikeboy10y ago· 3 in thread

New workaround: paste the article title into archive.is. I don't know what they're doing but they have a workaround of some sort.

mikedmiked10y ago

This is what I do.

I actually made a bookmarklet with the following pasted into the URL, so you can do it in a single click:

javascript:void(open('https://archive.is/?run=1&url='+encodeURIComponent(document....)

LunaSea10y ago

That's actually really interesting! Could anyone chime in and explain how they might work around this issue?

ikeboy10y ago

My guess is they have a login, but they could just be using some workaround like described in OP.

I've used them to save Facebook posts before, and the pages were logged in to some "Nathan" IIRC. They probably have a bunch of hacks for specific sites that needed fixing.

sylvinus10y ago· 2 in thread

Well, that trick won't last long either. It's trivial to verify that an IP indeed belongs to Google:

https://support.google.com/webmasters/answer/80553?hl=en

dogma113810y ago

Seems to work if you deploy a proxy on Google's app engine and use it to access WSJ ;)

slig10y ago

A App Engine wouldn't have a IP with a reverse DNS *.googlebot.com, would it?

1 more reply

hueving10y ago· 2 in thread

Based on the comments here, am I to understand that constantly browsing the web with my user agent string set to a googlebot string, I am committing a felony? How would I even know which sites I'm gaining unauthorized access to?

That is completely idiotic if there is a string you can put in a Mozilla browser config that is literally illegal to browse the web with.

effie10y ago

I do not think using any User-Agent alone constitutes a crime. There are valid non-criminal reasons why one would like to use Googlebot's or other User-Agent. I think it is the intent to bypass the paywall + success to do so that may be regarded as offense or even crime, but I'm not sure.

majewsky10y ago

Good luck trying to argue to a judge that the law is forbidden from being idiotic. ;)

GigabyteCoin10y ago· 2 in thread

I was under the impression that the "hack" whereby you searched for the article on Google and clicked through to that article (effectively skipping over the paywall) was a demand of Google's and not an oversight by the paywalled website.

I thought that google deemed providing search results which were behind paywalls as a "bad experience" for their search users, and would penalize websites for doing so.

Is this no longer the case?

elaineoOP10y ago

Google doesn't demand anything. If your paywalled website is not accessible by Google's crawler, then Google will not index it. Publishers want Google to index their pages and drive potential paying visitors, which is why they open the loophole themselves.

For the second point, Google does require that publishers specify "registration required" in their sitemap.

GigabyteCoin10y ago

If you're showing Googlebot one thing, and visitors who visit your website through google another, that's essentially "cloaking" (a blackhat SEO technique). At least it used to be.

coverband10y ago· 2 in thread

My Windows anti-virus deletes the linked sample code automatically upon download, marking it as "Trojan:Win32/Spursint.A". Did anyone have the same experience? (I was actually more interested in using it as a template for writing a simple Chrome extension.)

mattmaroon10y ago

Yep. I then pasted it but it didn't work on wsj.com. Oh well.

elaineoOP10y ago

try deleting cookies, then hit refresh.

1 more reply

dude_abides10y ago· 2 in thread

Or simply use incognito mode and click on Google search result.

mrmcd10y ago

Did you read the article. It talks about how that trick no longer works on a lot of sites because they are now checking User-Agent strings too.

lstamour10y ago

Actually I noticed sites have simply changed policies -- if you're a regular visitor your cookies will identify you and block content. The Incognito mode trick works for WSJ and others that would still check the referrer header. Allowing Googlebot access and checking the referrer header are two different things.

Also, Google has published IP addresses it uses, so this extension might not last long...

1 more reply

eps10y ago· 1 in thread

Correct me if I'm wrong, but wasn't there a long standing Google's policy that the version of the page served to their crawler must also be publicly accessible. That would then be the reason why WSJ articles were accessible through the paste-into-google trick, rather than because WSJ was incompetent and failed to "fix" the bypass.

So does it mean that Google will no longer index full WSJ articles or does it mean a change in the Google's policy?

morgante10y ago

You are correct, Google requires that you let users see the first click for free if you want to index content behind a paywall. [1]

Since this is billed as an "experiment" I'm guessing that WSJ is just testing the waters. If they roll it out to everyone, they will have to serve only snippets to Google or risk getting delisted.

[1] https://support.google.com/news/publisher/answer/40543?hl=en

mikemikemike10y ago· 1 in thread

This is an odd debate. Let's say a restaurant declares "veterans eat free." This blog post is like a friend telling you "Hey if you tell this restaurant you're a vet they'll give you a free meal." No one said it's legal or ethical. It's lying to trick someone into giving you something at their expense.

I think the relevant point, underscored by the author's last sentence, is it doesn't matter who you open a back door for - it opens the possibility for anyone to barge through.

elaineoOP10y ago

that's a good analogy.

zem10y ago· 1 in thread

i thought of doing that when the "search google" trick stopped working, but i decided it crossed the point where i would feel like i was unfairly circumventing their clear desire not to serve me the content. i've just added wsj to my mental ignore list and count it as a few more minutes gained to do something else.

ivansavz10y ago

Yeah, same here. Every time I get to a tab where I see a paywall I just close that tab, probably saving 5-10 mins of my life!

mikestew10y ago· 1 in thread

So does HN now choose to not post articles from the WSJ? I was comfortable with the "google it" trick, and frankly was a little annoyed with constant "paywall, wah!" comments when what should be by now a well-known workaround was available. But that workaround no longer works.

mark-r10y ago

They've been testing the new wall for a while now. I know I made one of those "paywall wah" comments when the Google workaround didn't work for me. Then the next time I tried it worked fine, so it must have been random selection.

warrenmar10y ago· 1 in thread

You can also access WSJ for free at the library.

creativityhurts10y ago

It reminds me of this: "Trying to save a quarter..." https://www.youtube.com/watch?v=j4nRHHPpnVc

systemz10y ago· 1 in thread

So their next move is check if IP is from Google

philip120910y ago

https://cloud.google.com/

obelisk_10y ago· 1 in thread

1. Google's Web Crawlers are not "bypassing" paywall. It's the paywall that let's crawlers through. I.e. exactly the reverse of what the author implies with their headline.

2. The idea that this is somehow new is wrong. The way for a server to identify crawlers have "always" been to look at the user-agent, and, when done right, IP, verified either by net block owner or by doing PTR lookup and then checking that the A or AAA record for the claimed host points back at the same IPv4 or IPv6 address. Meanwhile, I do agree that paywalling is a more recent phenomenon, at least with regards to the extend it is popular among sites today, but the concept of presenting different data to crawlers and visitors arose much earlier and is something Google have been aware of and has made sure to delist such sites when found, whereas in fact Google has since then moved abit in the direction of allowing it in that they do so for Google News if declared as explained by others ITT.

So in my view, it seems that the author is jumping to incorrect conclusions based on an incomplete understanding of what's actually going on here. What then about the HN readership, how come this article became so highly voted and I don't see these issues raised by anyone else? Or maybe I'm just crazy?

tomkwok10y ago

> Google's Web Crawlers are not "bypassing" paywall. It's the paywall that let's crawlers through. I.e. exactly the reverse of what the author implies with their headline.

Don't nitpick. It's just a shortened version of How To "Be" a Google’s Web Crawler to Bypass Paywalls. You get it. I get it. Everyone gets it.

kenshaw10y ago

Basically, the article is stating to change the User-Agent to GoogleBot or Bing or whatever other crawler UA you'd prefer. While that's doable, that's something that is easily detectable and prevented, as all of the big crawlers can be validated against DNS.

Additionally, I would like to point out that I wrote a Varnish extension for the express purpose of validating User-Agent strings through DNS lookups, and is available here: https://github.com/knq/libvmod-dns

It was built because we had specifically a problem with bad bots crawling a large site (multiply.com) and this was one of the easiest ways to filter out the bad bots from the good, and to enforce robots.txt policies on a per bot basis. It works very well, as you can do any kind of DNS caching internally and prevent this kind of behavior, if that's your goal.

jrochkind110y ago

I thought Google specifically disallowed returning different pages based on User-Agent targetting googlebot, and this included paywalls.

Are they running afoul of Google policies and going to get pinged by Google?

I can't find the text from Google now (when can you ever find any docs at google?), but I am very certain I remember reading from them that you may not return different content to GoogleBot based on User-Agent.

Gratsby10y ago

If you hit a paywall or a "sign up to access this content" message from a google search result, report it. Google will remove them from the search results, they will lose their largest traffic source, and they will address the issue. Or they won't because they have enough paying customers.

chrishn10y ago

> Remember: Any time you introduce an access point for a trusted third party, you inevitably end up allowing access to anybody.

coughNSAcough

tete10y ago

Doesn't Google usually try to punish websites that show users something different and even mentions that somewhere?

Not an SEO Expert here, but wonder how and whether Google will end up handling that. I mean making an exception could also be considered abuse of power in some countries of the world. Don't have any strong opinion yet on that, just saying that because of how the EU exercised certain laws in recent years.

Illniyar10y ago

Aren't you supposed to verify if a visitor is a googlebot by reverse lookup of the IP address? I.E.: https://support.google.com/webmasters/answer/80553?hl=en

User-agents are notoriously unreliable.

philip120910y ago

I wonder how many Google Cloud customers use the servers to run spoofed Googlebot crawlers from the Google IP range in order to bypass paywalls and scrape large sites (like LinkedIn) without hinderance.

0xCMP10y ago

It's broken already. Tried to access an article about new china rules for online news and it pay-walled me. They're probably looking for clients coming from googlebot.com now.

mildweed10y ago

Solution:

Content providers register a (yet-to-be-written) Google News API account, get an API key, with which Google indexes the site and the site recognizes as legit.

jasonwilk10y ago

I've noticed that this has stopped working on WSJ if you've already hit the paywall and try to google the article to bypass.

f13710y ago

I wonder if anybody tried to do as suggested? I copied the files to Chrome as per instructions, and the paywall was still in place.

jupp0r10y ago

It's not bypassing at all. Googles crawlers are deliberately let in because a paywall that nobody runs into is useless.

chinathrow10y ago

So soon they have to block anyone with a fake Google UA and whitelist the well known 66.249 IP range. Trivial.

yyin10y ago

Does WSJ check visits from a Googlebot UA against a list of known Google IP addresses?

amelius10y ago

Fix: replace the user agent string by a cryptographic challenge/response scheme.

pmontra10y ago

They'll start allowing only some IP addresses search engines agreed with them.

daveheq10y ago

Possible in Firefox? Some people won't use Chrome.

spitfire10y ago

Is there a version of this available for Safari?

throwaway2181610y ago

>Archaic news source does something to hurt their market penetration to internet

Great idea here guys

j / k navigate · click thread line to collapse

232 comments

141 comments · 44 top-level

zaroth10y ago· 25 in thread

jrockway10y ago

zaroth10y ago

It's intent that matters. Setting user-agent in order to properly render a page is legal. Setting a user-agent string to gain access to otherwise unauthorized content is probably not.

4 more replies

antsar10y ago

A decent Friday afternoon read on that topic:

http://webaim.org/blog/user-agent-string-history/

jd310y ago

I couldn't believe my eyes when I saw that netflix started sniffing the UA. You can't even play a video in SeaMonkey 2.39 without the incredibly stupid

general.useragent.override.netflix.com;Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:42.0) Gecko/20100101 Firefox/42.0

in about:config... what were they thinking?

dpweb10y ago

Pretty obvious weak move restricting by UA$.

Allow by IP range? You can probably find a somewhat accurate range for Google and whoever's crawlers.

cookiecaper10y ago

Google was fortunately that no one sued them for these things before they got big enough to defend themselves. Many tech entrepreneurs haven't been so lucky.

roywiggins10y ago

If those sites don't want spiders, they can just specify that in robots.txt, which Google honors, right?

dosaygo10y ago

skissane10y ago

swanson10y ago

I wonder if a user-agent that was something like "Not a Googlebot" would a) allow access (probably regex based) and b) be truthful/plausible deniability.

oneeyedpigeon10y ago

What is your purpose in setting your user-agent string to that value, other than cleverly bypassing the paywall?

1 more reply

DanielBMarkham10y ago

pbosko10y ago

How does that law apply to foreigners?

ec10968510y ago

Who has? Surely not the person who wrote the tutorial.

zaroth10y ago

Even worse, the poor author has created a hacking tool capable of enabling said felony, which I believe could get them 10 or 20 years... I'm looking for the statute now.

2 more replies

jerf10y ago

Under CFAA, I don't know. The DMCA may have some problems with that blog post, though.

And by "may", I do mean "may". I don't know. But it's at least possible.

1 more reply

1024core10y ago

I know you were kidding, but I still don't understand this "exceeds access" bullshit.

Remember: it's just a request. I can request 100 dollars from you; the fact that you give them to me does not make me a mugger.

barrkel10y ago

What if you tell me my car's broken when it's not and request 100 dollars from me to fix it? That would be fraud.

So you go up to a server and lie to it, and it gives you something; is that not acquiring things through deception?

maxerickson10y ago

prebrov10y ago

Whether the lock was implemented poorly or you just didn't lock it — doesn't matter.

fiatmoney10y ago

If I run a car wash with a sign, "red cars washed free", and you paint your car red, have you defrauded me? You're not really a red car, you're just pretending!

wmeredith10y ago

Oh pish posh. If you forge a key that fits the lock on my front door, yeah, your breaking the law. The lock itself is not an advertisement to come in if you can unlock it.

1 more reply

TheDong10y ago

Terrible analogy.

You're talking about a sign with practically no legal meaning.

This discussion is about a law which does have legal meaning.

The better analogy is the one below about breaking into homes being illegal, but what if you happen to have a key that fits the lock? (though that also is a bad analogy in its own way).

Basically, making physical analogies for technical matters is rarely correct. It's often the best way to convince non-technical people of a matter without them needing to actually understand it.

2 more replies

hughes10y ago

Is that significantly different from viewing google's cached result of their legitimate access? The result is the same.

TheDong10y ago

The difference is intent!

Sure, it's technically similar, but the court doesn't care. The court doesn't care if the law makes no technical sense, because it's a law, not a program.

1 more reply

mbroshi10y ago· 9 in thread

(See the second precept here: https://en.wikipedia.org/wiki/Five_Precepts)

azakai10y ago

Legally it might, I don't know.

Morally, I'm not sure.

1. If they allowed all bots but disallowed all non-bots, that would raise questions of what defines a "bot". Couldn't I write a personal bot that fetches the story for me? As a browser addon, even?

mankyd10y ago

> If they allowed all bots but disallowed all non-bots, that would raise questions of what defines a "bot".

> But in practice, it seems they favor certain bots. Is it ok that the WSJ lets Google do things Google's competitors cannot?

Of course, that hampers a lot of legitimate use cases along the way.

1 more reply

njharman10y ago

No content is being given or taken. This is restriction on distribution / copying. Massively different ethics than stealing despite what the Copyright Cabal wants you to think.

The only loss is the energy/bandwidth/cycles WSJ servers spent answering your request. Which, I believe, has been basis of computer "fraud" cases.

zodiac10y ago

> Which, I believe, has been basis of computer "fraud" cases.

Moreover, I don't think prosecutors need to prove that you caused harm in order to charge you with computer fraud, since, for example, CFAA falls under criminal law.

2 more replies

ikeboy10y ago

WSJ is violating Google's rules referenced here: https://support.google.com/news/publisher/answer/40543?hl=en

mod10y ago

So? Is this a "two wrongs make a right?"

More importantly, is violating "Google's rules" suddenly a violation of law?

2 more replies

RIMR10y ago

> abstain from taking what is not given

I'm not taking, I am simply absorbing information. That information will still be there when I am done reading it. Have I really stolen, or did I just refuse to give someone money on demand?

elaineoOP10y ago

I don't disagree with you. But, given the nature of this forum, I think that the information content has merit.

What people choose to do with the information is another story...

jhildings10y ago

Stealing would imply that they no longer are in possession or in control of the content, which of course is false. Copying illegaly however, yes

lloyddobbler10y ago· 8 in thread

"Remember: Any time you introduce an access point for a trusted third party, you inevitably end up allowing access to anybody."

m52go10y ago

I came back here to post this line! It's so perfect.

matheweis10y ago

I was going to post here that there were ways around that; proper security, cryptographic access control... And then I saw the light. ;)

elaineoOP10y ago

Haha! Thanks for catching that :)

NKCSS10y ago

Google also specifies the ip ranges of their boys; just UA checking is sloppy

queeerkopf10y ago

Hmmm, the following seems to contradict that; instead google recommends verification by DNS lookup: https://support.google.com/webmasters/answer/80553

2 more replies

melted10y ago

kbenson10y ago

1 more reply

interpol_p10y ago

slig10y ago· 6 in thread

If they're now blocking clicks from Google, doesn't that mean that they're cloaking and violating the Google's Webmaster Guidelines [1]?

[1]: https://support.google.com/webmasters/answer/66355?hl=en

elaineoOP10y ago

https://support.google.com/news/publisher/answer/74288?hl=en

ikeboy10y ago

https://support.google.com/news/publisher/answer/40543?hl=en seems to specifically ban this. WSJ is in violation, not fitting any of the categories there.

slig10y ago

I didn't know that, thanks. But reading, it seems that option is about Google News, not the main Google Search.

> News-specific tag definitions

> Yes, if access is not open, else should be omitted

pacquiao88210y ago

WSJ is big enough to negotiate their own terms with Google Search.

1 more reply

CM3010y ago

Implicated10y ago

Yea, except they're not going to get that ban.

metafunctor10y ago· 5 in thread

I'm pretty sure Google will soon stop indexing WSJ. Why index something if the vast majority of users cannot access the pages behind the links?

EDIT: The "paste a headline into Google" trick still works for me, though. If this continues to be the case, they will keep indexing, of course.

rhino36910y ago

>Why index something if the vast majority of users cannot access the pages behind the links?

So people can find it? I'd be pissed if Google de-indexed something like IEEE because it has a paywall.

Assuming the internet has to be freely available is a mistake. Especially with the continued growth of an adblocked internet. We could be facing an internet with significant paywalls in the future.

I'd support a "free" search term to weed out paywalled results.

Furthermore, Google shouldn't be making normative judgements about what people should see. It's an abuse of their monopoly.

morgante10y ago

Google doesn't forbid or de-index paywalls. What they forbid is cloaking (showing Google different content than what users will see). This is, of course, quite critical to maintaining search quality.

WSJ is free to institute a full paywall and only serve snippets to Google. They might now like what it does to their rankings though.

What they cannot do is continue to sniff the UA before deciding to put up the paywall. (Though I'm still able to use the Google trick, so it seems the experiment might have ended.)

ikeboy10y ago

It doesn't work for me. They did say they're "testing" it, so maybe A/B testing conversion rates.

And yes, this violates Google's policies laid out explicitly at https://support.google.com/news/publisher/answer/40543?hl=en

Joky10y ago

Indexing is fine, a great feature would be if Google was able to show it only to the user that can access it.

jsprogrammer10y ago

Why should Google manage WSJ's paywall?

1 more reply

jgh10y ago· 5 in thread

I just tried clicking on "Harper Lee, Author of ‘To Kill a Mockingbird,’ Dies at Age 89" from wsj.com's homepage and got the paywall.

I then pasted the headline into google and clicked on it from Google results and did not get hit by the paywall.

zem10y ago

they're probably doing some sort of a/b testing by selectively letting some clicks through

adamrights10y ago

This is basically true ^^

1 more reply

e4010y ago

Mine did, which surprised me, and I did exactly the same thing.

creativityhurts10y ago

I did the same with the same article and got paywalled from Google results.

gsibble10y ago

I also checked the paywall and the old trick still works for me. Odd.

mangeletti10y ago· 4 in thread

# Agenda-Driven Content

Laaw10y ago

If they make sending your DNA a requirement of consuming their content, then yes, you send it to them if you want their content. That's their right, as owners of something, to dictate its use.

You aren't entitled to WSJ.com, NBC, or Fox.

mangeletti10y ago

oneeyedpigeon10y ago

> Take a look, for instance, at the WSJ.com home page with an ad blocker turned on (note all the missing letters and scrambled up titles).

[1] http://www.nme.com/

0xffff210y ago

>Take a look, for instance, at the WSJ.com home page with an ad blocker turned on (note all the missing letters and scrambled up titles)

Uh, what? Using uBlock origin, when I visit wsj.com I get what looks like a perfectly normal page. Nothing is scrambled at all.

anewhnaccount210y ago· 3 in thread

If this is true, what WSJ is doing is called "cloaking" and should cause it to get de-indexed: https://support.google.com/webmasters/answer/66355?hl=en

vonklaus10y ago

jonknee10y ago

What? Any site can very easily not be in Google if they choose to. It's a very dumb decision for a news site, but you're free to do it.

1 more reply

jacalata10y ago

Ha yea, why let Google have the power to destroy your business when you can burn it to the ground yourself!

matt_wulfeck10y ago· 3 in thread

I like wsj but I only read maybe 1 article every other day. They need a more reasonable price point, especially since the market will almost bear no price at all.

That being said I do enjoy their content, save for maybe the op-eds.

acomjean10y ago

I'm surprised that most online papers won't sell you one day's worth online for a buck or so. Like buying a real newspaper.

They all seem to want to sell subscriptions, which are perpetual and probably difficult to cancel..

matt_wulfeck10y ago

Even a dollar a day. I can pay Netflix $10 a month and stream unlimited HD video, but wsj wants $30 to read the first few paragraphs of a few articles a day?

The pricing here is much too aggressive

1 more reply

germanier10y ago

You can buy single WSJ print or online articles for 29 cents on Blendle or today's whole paper for 3.20.

I don't know if it's available in the US yet but they are at least planning to launch in the near future.

crazysim10y ago· 3 in thread

Otherwise, why would expertsexchange be obligated to provide the answers at the very bottom? Did something change?

eli10y ago

I'm 99% sure I've encountered a Googlebot crawling pages with the UA of a regular browser, presumably for exactly this purpose.

chinathrow10y ago

They do this via an iPhone like UA, though googlebot is still in there too.

1 more reply

skocznymroczny10y ago

expert sex change

jdunck10y ago· 3 in thread

If Google (or any other crawler) wanted to play nice with paywalls, they could issue a public key for their bot, and put a signature in their User Agent string that the domain could then verify.

Those signatures could obviously leak, but on a per-domain basis. Perhaps the domains could have a secure way of bumping the valid key generation if they had a leak.

AnthonyMouse10y ago

There are two problems with this.

LunaSea10y ago

The current situation with the WSJ could already be considered an antitrust violation. It's whitelisting one crawler and leaving the other ones out.

desdiv10y ago

Google (and every other major search engine) already provide a way, i.e. reverse DNS lookup, to authentic bot ownership:

https://support.google.com/webmasters/answer/80553?hl=en

AFAIK no content provider actually does this check though.

mchahn10y ago· 3 in thread

Bypassing the paywall is more unethical that blocking ads. It is one thing to have control over your own browser but another to steal something from another site.

Also, isn't it illegal to bypass computer security?

nkrisc10y ago

How is modifying your own request headers any different than choosing to not display content returned in the response body?

Their server can choose to do what it wants with your request and you can choose what to do with the response it sends.

Are User-Agent headers legally protected identities?

drostie10y ago

Then again, lots of sci-fi dystopias are dreams of an automated law that somehow destroys the fabric of society, so...

effie10y ago

1 more reply

ikeboy10y ago· 3 in thread

New workaround: paste the article title into archive.is. I don't know what they're doing but they have a workaround of some sort.

mikedmiked10y ago

This is what I do.

I actually made a bookmarklet with the following pasted into the URL, so you can do it in a single click:

javascript:void(open('https://archive.is/?run=1&url='+encodeURIComponent(document....)

LunaSea10y ago

That's actually really interesting! Could anyone chime in and explain how they might work around this issue?

ikeboy10y ago

My guess is they have a login, but they could just be using some workaround like described in OP.

I've used them to save Facebook posts before, and the pages were logged in to some "Nathan" IIRC. They probably have a bunch of hacks for specific sites that needed fixing.

sylvinus10y ago· 2 in thread

Well, that trick won't last long either. It's trivial to verify that an IP indeed belongs to Google:

https://support.google.com/webmasters/answer/80553?hl=en

dogma113810y ago

Seems to work if you deploy a proxy on Google's app engine and use it to access WSJ ;)

slig10y ago

A App Engine wouldn't have a IP with a reverse DNS *.googlebot.com, would it?

1 more reply

hueving10y ago· 2 in thread

That is completely idiotic if there is a string you can put in a Mozilla browser config that is literally illegal to browse the web with.

effie10y ago

majewsky10y ago

Good luck trying to argue to a judge that the law is forbidden from being idiotic. ;)

GigabyteCoin10y ago· 2 in thread

I thought that google deemed providing search results which were behind paywalls as a "bad experience" for their search users, and would penalize websites for doing so.

Is this no longer the case?

elaineoOP10y ago

For the second point, Google does require that publishers specify "registration required" in their sitemap.

GigabyteCoin10y ago

If you're showing Googlebot one thing, and visitors who visit your website through google another, that's essentially "cloaking" (a blackhat SEO technique). At least it used to be.

coverband10y ago· 2 in thread

mattmaroon10y ago

Yep. I then pasted it but it didn't work on wsj.com. Oh well.

elaineoOP10y ago

try deleting cookies, then hit refresh.

1 more reply

dude_abides10y ago· 2 in thread

Or simply use incognito mode and click on Google search result.

mrmcd10y ago

Did you read the article. It talks about how that trick no longer works on a lot of sites because they are now checking User-Agent strings too.

lstamour10y ago

Also, Google has published IP addresses it uses, so this extension might not last long...

1 more reply

eps10y ago· 1 in thread

So does it mean that Google will no longer index full WSJ articles or does it mean a change in the Google's policy?

morgante10y ago

You are correct, Google requires that you let users see the first click for free if you want to index content behind a paywall. [1]

Since this is billed as an "experiment" I'm guessing that WSJ is just testing the waters. If they roll it out to everyone, they will have to serve only snippets to Google or risk getting delisted.

[1] https://support.google.com/news/publisher/answer/40543?hl=en

mikemikemike10y ago· 1 in thread

I think the relevant point, underscored by the author's last sentence, is it doesn't matter who you open a back door for - it opens the possibility for anyone to barge through.

elaineoOP10y ago

that's a good analogy.

zem10y ago· 1 in thread

ivansavz10y ago

Yeah, same here. Every time I get to a tab where I see a paywall I just close that tab, probably saving 5-10 mins of my life!

mikestew10y ago· 1 in thread

mark-r10y ago

warrenmar10y ago· 1 in thread

You can also access WSJ for free at the library.

creativityhurts10y ago

It reminds me of this: "Trying to save a quarter..." https://www.youtube.com/watch?v=j4nRHHPpnVc

systemz10y ago· 1 in thread

So their next move is check if IP is from Google

philip120910y ago

https://cloud.google.com/

obelisk_10y ago· 1 in thread

1. Google's Web Crawlers are not "bypassing" paywall. It's the paywall that let's crawlers through. I.e. exactly the reverse of what the author implies with their headline.

tomkwok10y ago

> Google's Web Crawlers are not "bypassing" paywall. It's the paywall that let's crawlers through. I.e. exactly the reverse of what the author implies with their headline.

Don't nitpick. It's just a shortened version of How To "Be" a Google’s Web Crawler to Bypass Paywalls. You get it. I get it. Everyone gets it.

kenshaw10y ago

jrochkind110y ago

I thought Google specifically disallowed returning different pages based on User-Agent targetting googlebot, and this included paywalls.

Are they running afoul of Google policies and going to get pinged by Google?

Gratsby10y ago

chrishn10y ago

> Remember: Any time you introduce an access point for a trusted third party, you inevitably end up allowing access to anybody.

coughNSAcough

tete10y ago

Doesn't Google usually try to punish websites that show users something different and even mentions that somewhere?

Illniyar10y ago

Aren't you supposed to verify if a visitor is a googlebot by reverse lookup of the IP address? I.E.: https://support.google.com/webmasters/answer/80553?hl=en

User-agents are notoriously unreliable.

philip120910y ago

0xCMP10y ago

It's broken already. Tried to access an article about new china rules for online news and it pay-walled me. They're probably looking for clients coming from googlebot.com now.

mildweed10y ago

Solution:

Content providers register a (yet-to-be-written) Google News API account, get an API key, with which Google indexes the site and the site recognizes as legit.

jasonwilk10y ago

I've noticed that this has stopped working on WSJ if you've already hit the paywall and try to google the article to bypass.

f13710y ago

I wonder if anybody tried to do as suggested? I copied the files to Chrome as per instructions, and the paywall was still in place.

jupp0r10y ago

It's not bypassing at all. Googles crawlers are deliberately let in because a paywall that nobody runs into is useless.

chinathrow10y ago

So soon they have to block anyone with a fake Google UA and whitelist the well known 66.249 IP range. Trivial.

yyin10y ago

Does WSJ check visits from a Googlebot UA against a list of known Google IP addresses?

amelius10y ago

Fix: replace the user agent string by a cryptographic challenge/response scheme.

pmontra10y ago

They'll start allowing only some IP addresses search engines agreed with them.

daveheq10y ago

Possible in Firefox? Some people won't use Chrome.

spitfire10y ago

Is there a version of this available for Safari?

throwaway2181610y ago

>Archaic news source does something to hurt their market penetration to internet

Great idea here guys

j / k navigate · click thread line to collapse