Ask HN: How does archive.is bypass paywalls?

Hamuko3y ago

Would it be possible to check if archive.is is logged into a newspaper site by archiving one of the user management pages?

hoofhearted3y ago

Negative. I used to assume this as well, but they somehow also bypass local paywalls which have gotten me temporarily banned from r/Baltimore lol.

They can somehow even bypass the Baltimore suns paywalls, and I doubt they have subscriptions to every regional paper, could they?

jrochkind13y ago

Wait, you got banned from /r/Baltimore for posting archive.is links there? That's against the rules there? I would not have known that myself! (Also a Baltimorean).

https://www.wsj.com/articles/freeze-or-cut-spending-fight-is... https://amp.wsj.com/articles/freeze-or-cut-spending-fight-is...

fleroviumOP3y ago

But is it true? What evidence is there?

This is a plausible explanation but is it true?

stevefan19993y ago

So scihub but for newspapers

janejeon3y ago· 6 in thread

> If it identifies itself as archive.is, then other people could identify themselves the same way.

Theoretically, they could just publish the list of IP ranges that canonically "belongs" to archive.is. That would allow websites to distinguish if a request identifying itself as archive.is is actually from them (it fits one of the IP ranges), or is a fraudster.

lazzlazzlazz3y ago

It would be far better and more secure for archive.is to publish a public key on its site and then sign requests from its private key, which sites could optionally verify.

sublinear3y ago

You just described client certificate auth

facile3y ago

+1 on this!

fleroviumOP3y ago

In theory, this might work. But is it true? Do lots of sites have an archive.is whitelist?

arbitrage3y ago

I really don't see why they would, if they're using a paywall in the first place.

w1nst0nsm1th3y ago

Follow the magnolia trail...

Miner49er3y ago· 6 in thread

According to their blog they use AMP: https://blog.archive.today/post/675805841411178496/how-does-...

fleroviumOP3y ago

This explanation is incomplete. Counterexample:

Amp pages are paywalled:

archive.is isn't: https://archive.md/LaiOX

Deathmax3y ago

For WSJ at least, it appears that archive.is is fetching the AMP page, which returns the full content of the article and is hidden with CSS, and modifying the page to unhide the paywalled content + hide ads.

It might be using other techniques as well for bypassing paywalls, be it referer/user-agent spoofing (some old archives of sites that echo back HTTP request headers have archive.is sending a Referer of google.co.uk).

Try this: https://www.wsj.com/amp/articles/freeze-or-cut-spending-figh...

Reventlov3y ago

I can access the wsj article without any account using https://gitlab.com/magnolia1234/bypass-paywalls-firefox-clea... (bypass paywall clean)

JohnFen3y ago

Wow, an actually good use for Amp? Amazing.

Aachen3y ago

I'm sure it was an accident or honest mistake!

lcnPylGDnU4H9OF3y ago· 5 in thread

I think a browser extension which people who have access to the article use to send the article data to the archive server.

phoenixreader3y ago

You mean the pages are crowdsourced? I don’t think so because many pages are archived only upon request. If I ask to archive a new page, archive.is provides it very quickly. This is not possible if the archive is built from crowdsourced data.

AlbertCory3y ago

That is how RECAP works ("Pacer" spelled backwards).

In that case, the government is fine with it.

wolverine8763y ago

I think that's how Sci-hub works, at least at some time in the past.

1 more reply

fleroviumOP3y ago

Can you explain? Who has purchased the subscription? I'm sure there's a no-redistribution clause in the subscription agreement.

lcnPylGDnU4H9OF3y ago

The person who installed the browser extension would be paying the subscription and ignoring said clause.

World1773y ago· 4 in thread

I think they might just try all the user agents in the robots.txt. [1] I've included a picture showing an example. In this second image, [2] I receive the paywall with the user agent left as default. There might also just be an archival user agent that most websites accept, but I haven't looked into it very much.

[1] https://i.imgur.com/lyeRTKo.png

[2] https://i.imgur.com/IlBhObn.png

jrochkind13y ago

That user-agent seems to be in the robots.txt as _disallowed_, but somehow it gets through the paywall? That seems counter-intuitive.

World1773y ago

It's just blocking the root. Look up the specifications for the robots.txt for more information. One purpose is to reduce loads on parts of the website that they do not want indexed.

fleroviumOP3y ago

That's an interesting idea, but is it true?

World1773y ago

Websites usually want their pages indexed for search engines, as it increases the traffic they receive. They also often try to allow archival usage. The robots.txt usually has defined user agents used by search engines defined, as one purpose is to reduce load on the website by not indexing pages that do not need to be indexed.

It might not be what is happening as there are other ways around, but this is a real possibility for how it could be done. (at least until the websites allowing other user agents decide they want to try to stop archive.is usage, etc)

edit: I think the probability is probably high that they have multiple methods for archiving a website. I think in this post, there are many people stating that they've previously stated they just convert the link to an AMP link and archive it. I'm more so doubtful that's all they do, but it could be it too.

Using the robots.txt file in this way might not be how the author's of the website intended for it to be used. I could see that maybe being used against them in a legal system if someone ever tried to stop them. In the past, I've seen websites state to people creating bots to purposefully change their user agent to one they defined, but, using it for a non-allowed purpose is what I was mentioning. Though, there are multiple ways they could be archiving a website, so this is not necessarily how it is being done.

chrisco2553y ago· 4 in thread

Just archived a website I created. It looks like it runs HTTP requests from a server to pull the HTML, JS and image files (it shows the individual requests completing before the archival process is complete). It must then snapshot the rendered output, then it renders those assets served from their domain. Buttons on my site don't work after the snapshot, since the scripts were stripped.

strunz3y ago

Your missing the point of "how does it bypass firewalls"

hoofhearted3y ago

Surprisingly, nobody has mentioned this here yet. I’m thinking the key to this is SEO, SERP’s, and newspapers wanting Google to find and index their content.

This is my best guess for this. I’ve really put some thought into this, and this is the best logical assumption I’ve arrived to. I used to be a master of crawlers using Selenium years ago, but that burned me out a little bit so I moved on.

To test my hypothesis, you can go and find any article on Google that you know is probably paywalled. You click the content google shows you, and you navigate into the site, and “bam! Paywall!”..

If it has a paywall for me, well then how did Google crawl and index all the metadata for the SERP if it has a paywall?

I have a long running theory that Archive.is knows how to work around an SEO trick that Google uses to get the content. Websites like the Baltimore Sun don’t want humans to view their content for free, but they do want Googlebot to see it for free.

chrisco2553y ago

Sorry, thought it was obvious. Since it's using backend infrastructure to fetch the assets, it can crawl them as a bot in the same way that search engines do, without allowing cookies to be saved. Since scripts are often involved in the full rendering of a page, it clearly does allow for the scripts to load before snapshotting the DOM. But only the DOM and the assets and styles are preserved. Scripts are not. Most paywalls are simple scripts. If you disable JS and cookies, you'll often see the full text of an article.

[1] https://developers.google.com/search/docs/crawling-indexing/... [2] https://www.bing.com/webmasters/help/which-crawlers-does-bin...

wackget3y ago

paywalls*

retrocryptid3y ago· 4 in thread

Many (most?) "big content" sites let Google and Bing spiders scrape the contents of articles so when people search for terms in the article they'll find a hit and then get referred to the pay wall.

Google doesn't want everyone to know what a Google indexing request looks like for fear the CEO mafia will institute shenanigans. And the content providers (NYT, WaPo, etc.) don't want people to know 'cause they don't want people evading their paywall.

Or maybe they're okay with letting the archive index their content...

Atlas223y ago

Just FYI google and bing publish their user agent strings[1][2] for the crawlers. At least in my experience most of the typical ad-infested and paywalled news sites wont display the paywall if you change the user agent to a crawler they prefer.

wolverine8763y ago

Doesn't almost every site on the web know exactly what the Google bot looks like?

peter4223y ago

Google gives precise details about how to verify their bot is crawling your site and how to denote what content is paywalled and what isn’t.

Aachen3y ago

Bingo. This is what I use to incentivize using a nonmonopolistic search engine to find the few sites I run.

xiekomb3y ago· 4 in thread

I thought they used this browser extension: https://gitlab.com/magnolia1234/bypass-paywalls-chrome-clean

fleroviumOP3y ago

That extension does work, but do we know they use it?

marcod3y ago

They don't always use it, because I can archive a new page from my mobile phone browser, which doesn't even support extensions.

My guess is that most content providers with paywalls serve the entire content, so search engines can pick it up, and then use scripts to raise the paywall - archive.is takes their snapshot before that happens / doesn't trigger those scripts.

DrDentz3y ago

It's actually the opposite, for some news sites this extension links to archive.is because that's the only known way to bypass the paywall.

There are known ways to bypass paywall which are just impossible to implement within a browser extension while trivial on 12ft or archive. For example, to use Ukrainian residential proxy as some news websites granted free access from.

Yujf3y ago· 4 in thread

I don't know about archive.is, but 12ft.io does identify as google to bypass paywalls afaik

strunz3y ago

12ft.io also doesn't work or is disabled for many sites that archive.is still works on

hda1113y ago

Maybe because the creator of 12ft.io isn't anonymous

janejeon3y ago

Wouldn't sites be able to see that requests from 12ft.io isn't coming from Google's IPs?

dpifke3y ago

Yes.

Google recommends using reverse DNS to verify whether a visitor claiming to be Googlebot is legitimate or not: https://developers.google.com/search/docs/crawling-indexing/...

You can also verify IP ownership using WHOIS, or by examining BGP routing tables to see which ASN is announcing the IP range. Google also publishes their IP address ranges here: https://www.gstatic.com/ipranges/goog.json

throwaway815233y ago· 3 in thread

Off topic but for years I've been using a one-off proxy to strip javascript and crap from my local newspaper site (sfgate.com). It just reads the site with python urllib.request and then does some DOM cleanup with beautiful soup. I wasn't doing any site crawling or exposing the proxy to 1000s of people or anything like that. It was just improving my own reading experience.

Just in the past day or so, sfgate.com put in some kind of anti scraping stuff, so urllib, curl, lynx etc. now all fail with 403. Maybe I'll undertake the bigger and slower hassle of trying to read the site with selenium or maybe I'll just give up on newspapers and get my news from HN ;).

I wonder if archive.is has had its sfgate.com experience change. Just had to mention them to stay slightly on topic.

1ark3y ago

They are probably just checking headers such as user agent and cookies. Would copy whatever your normal browser sends and put it in the urllib.request. If that doesn’t work, then it is likely more sophisticated.

throwaway815233y ago

I will try that, but a quick look at the error page makes me think it tries to run a javascript blob.

(https://archive.is/1h4UV)

withinboredom3y ago

Sounds like an ADA lawsuit waiting to happen. I'd send the editor an email explaining how they've reduced usability of the site; especially if you're a paying customer.

RicoElectrico3y ago· 1 in thread

Nice try, media company employee ;)

/jk

PTOB3y ago

My sentiments exactly.

riffic3y ago· 1 in thread

your browser usually downloads an entire article and certain elements are overlayed.

it's trivial to bypass most paywalls isn't it?

aidenn03y ago

Not for some (I think the Wall Street Journal). Apparently the AMP version of the page does work this way for WSJ though, which is how IA gets around the paywall.

armchairhacker3y ago

A lot of sites don't seem to care about their paywall. Plenty of them load the full article, then "block" me from seeing the rest by adding a popup and `overflow: hidden` to `body`, which is super easy to bypass with devtools. Others give you "free articles" via a cookie or localStorage, which you can of course remove to get more free articles.

There are your readers who will see a paywall and then pay, and there are your readers who will try to bypass it or simply not read at all. And articles spread through social media attention, and a paywalled article gets much less attention, so it's non-negligibly beneficial to have people read the article for free who would otherwise not read it.

Which is to say: the methods archiv.is uses may not be that special. Clear cookies, block JavaScript, and make deals with or special-case the few sites which actually enforce their paywalls. Or identify yourself as archiv.is, and if others do that to bypass the paywall, good for them.

alex_young3y ago

Not specifically related to archive.is, but news sites have a tightrope to walk.

They need to both allow the full content of their articles to be accessed by crawlers so they can show up in search results, but they also want to restrict access via paywalls. They use 2 main methods to achieve this: javascript DOM manipulation and IP address rate limiting.

Conceivably one could build a system which directly accesses a given document one time from a unique IP address and then cache the HTML version of the page for further serving.

w1nst0nsm1th3y ago

If the people who know that tell you, they could lose access to said ressources.

But it's kind of an open secret, you just don't look in the right place.

thallosaurus3y ago

I just tried it with a local newspaper, it did remove the floating pane but didn't unblur and the text is also scrambled (used to be way worse protected, firefox reader mode could easily bypass it)

jrochkind13y ago

Every once in a while I _do_ get a retrieval from archive.is that has the paywall intact.

But I don't know the answer either.

firexcy3y ago

My hypothesis is that they use a set of generic methods (e.g., robot UA, transient cache, and JS filtering) and rely on user reports (they have a tumblr page for that) to identify and manually fix access to specific sites. Having a look at the source of the bypasspaywallclean extension will give you a good idea of most useful bypassing methods. Indeed, most publishers are only incentivized to paywall their content to the degree where most of their audience are directed to pay and they have to leave backdoors here or there for purposes such as SEO.

w1nst0nsm1th3y ago

Follow the magnolia trail...

shipscode3y ago

What happens when you first load a paywalled article? 9 times out of 10 it shows the entire article before the JS that runs on the page pops up the paywall. Seems like it probably just takes a snapshot prior to JS paywall execution combined with the Google referrer trick or something along those lines.

not_your_vase3y ago

They use you, as a proxy. If you (who archives it) have access to the site (either because you paid or have free articles), they can archive it too. If you don't have access, they only archive a paywall.

mr-pink3y ago

every time you visit they force some kid in a third world country to answer captchas until they can pay for one article's worth of content

jwildeboer3y ago

It’s internet magic. <rainbowmagicsparkles.gif> ;)

jakedata3y ago

Alas it doesn't allow access to the comment section of the WSJ which is the only reason I would visit the site. WSJ comments re-enforce my opinion of the majority of humanity. My father allowed his subscription to lapse and I won't send them my money so I will just have to imagine it.

j / k navigate · click thread line to collapse

102 comments

76 comments · 24 top-level

fxtentacle3y ago· 10 in thread

panopticon3y ago

I would expect to see login information rather than "Sign In" and "Subscribe" buttons on archived articles then. Unless they're stripping that from the archive?

phoenixreader3y ago

Exactly. It also would not be difficult for website operators to embed hidden user info in their served pages, thereby finding out the archive.is account. This approach seems risky for archive.is.

hda1113y ago

They could just copy the div with the content over to evade detection of the website’s owner

tivert3y ago

I wouldn't be surprised. IIRC, the whole thing is privately funded by one individual, who must have a lot of money to spare.

Stagnant3y ago

1: https://archive.is/Pum1p

Hamuko3y ago

Would it be possible to check if archive.is is logged into a newspaper site by archiving one of the user management pages?

hoofhearted3y ago

Negative. I used to assume this as well, but they somehow also bypass local paywalls which have gotten me temporarily banned from r/Baltimore lol.

They can somehow even bypass the Baltimore suns paywalls, and I doubt they have subscriptions to every regional paper, could they?

jrochkind13y ago

Wait, you got banned from /r/Baltimore for posting archive.is links there? That's against the rules there? I would not have known that myself! (Also a Baltimorean).

https://www.wsj.com/articles/freeze-or-cut-spending-fight-is... https://amp.wsj.com/articles/freeze-or-cut-spending-fight-is...

fleroviumOP3y ago

But is it true? What evidence is there?

This is a plausible explanation but is it true?

stevefan19993y ago

So scihub but for newspapers

janejeon3y ago· 6 in thread

> If it identifies itself as archive.is, then other people could identify themselves the same way.

lazzlazzlazz3y ago

It would be far better and more secure for archive.is to publish a public key on its site and then sign requests from its private key, which sites could optionally verify.

sublinear3y ago

You just described client certificate auth

facile3y ago

+1 on this!

fleroviumOP3y ago

In theory, this might work. But is it true? Do lots of sites have an archive.is whitelist?

arbitrage3y ago

I really don't see why they would, if they're using a paywall in the first place.

w1nst0nsm1th3y ago

Follow the magnolia trail...

Miner49er3y ago· 6 in thread

According to their blog they use AMP: https://blog.archive.today/post/675805841411178496/how-does-...

fleroviumOP3y ago

This explanation is incomplete. Counterexample:

Amp pages are paywalled:

archive.is isn't: https://archive.md/LaiOX

Deathmax3y ago

Try this: https://www.wsj.com/amp/articles/freeze-or-cut-spending-figh...

Reventlov3y ago

I can access the wsj article without any account using https://gitlab.com/magnolia1234/bypass-paywalls-firefox-clea... (bypass paywall clean)

JohnFen3y ago

Wow, an actually good use for Amp? Amazing.

Aachen3y ago

I'm sure it was an accident or honest mistake!

lcnPylGDnU4H9OF3y ago· 5 in thread

I think a browser extension which people who have access to the article use to send the article data to the archive server.

phoenixreader3y ago

AlbertCory3y ago

That is how RECAP works ("Pacer" spelled backwards).

In that case, the government is fine with it.

wolverine8763y ago

I think that's how Sci-hub works, at least at some time in the past.

1 more reply

fleroviumOP3y ago

Can you explain? Who has purchased the subscription? I'm sure there's a no-redistribution clause in the subscription agreement.

lcnPylGDnU4H9OF3y ago

The person who installed the browser extension would be paying the subscription and ignoring said clause.

World1773y ago· 4 in thread

[1] https://i.imgur.com/lyeRTKo.png

[2] https://i.imgur.com/IlBhObn.png

jrochkind13y ago

That user-agent seems to be in the robots.txt as _disallowed_, but somehow it gets through the paywall? That seems counter-intuitive.

World1773y ago

It's just blocking the root. Look up the specifications for the robots.txt for more information. One purpose is to reduce loads on parts of the website that they do not want indexed.

fleroviumOP3y ago

That's an interesting idea, but is it true?

World1773y ago

chrisco2553y ago· 4 in thread

strunz3y ago

Your missing the point of "how does it bypass firewalls"

hoofhearted3y ago

Surprisingly, nobody has mentioned this here yet. I’m thinking the key to this is SEO, SERP’s, and newspapers wanting Google to find and index their content.

To test my hypothesis, you can go and find any article on Google that you know is probably paywalled. You click the content google shows you, and you navigate into the site, and “bam! Paywall!”..

If it has a paywall for me, well then how did Google crawl and index all the metadata for the SERP if it has a paywall?

chrisco2553y ago

[1] https://developers.google.com/search/docs/crawling-indexing/... [2] https://www.bing.com/webmasters/help/which-crawlers-does-bin...

wackget3y ago

paywalls*

retrocryptid3y ago· 4 in thread

Many (most?) "big content" sites let Google and Bing spiders scrape the contents of articles so when people search for terms in the article they'll find a hit and then get referred to the pay wall.

Or maybe they're okay with letting the archive index their content...

Atlas223y ago

wolverine8763y ago

Doesn't almost every site on the web know exactly what the Google bot looks like?

peter4223y ago

Google gives precise details about how to verify their bot is crawling your site and how to denote what content is paywalled and what isn’t.

Aachen3y ago

Bingo. This is what I use to incentivize using a nonmonopolistic search engine to find the few sites I run.

xiekomb3y ago· 4 in thread

I thought they used this browser extension: https://gitlab.com/magnolia1234/bypass-paywalls-chrome-clean

fleroviumOP3y ago

That extension does work, but do we know they use it?

marcod3y ago

They don't always use it, because I can archive a new page from my mobile phone browser, which doesn't even support extensions.

DrDentz3y ago

It's actually the opposite, for some news sites this extension links to archive.is because that's the only known way to bypass the paywall.

Yujf3y ago· 4 in thread

I don't know about archive.is, but 12ft.io does identify as google to bypass paywalls afaik

strunz3y ago

12ft.io also doesn't work or is disabled for many sites that archive.is still works on

hda1113y ago

Maybe because the creator of 12ft.io isn't anonymous

janejeon3y ago

Wouldn't sites be able to see that requests from 12ft.io isn't coming from Google's IPs?

dpifke3y ago

Yes.

Google recommends using reverse DNS to verify whether a visitor claiming to be Googlebot is legitimate or not: https://developers.google.com/search/docs/crawling-indexing/...

throwaway815233y ago· 3 in thread

I wonder if archive.is has had its sfgate.com experience change. Just had to mention them to stay slightly on topic.

1ark3y ago

throwaway815233y ago

I will try that, but a quick look at the error page makes me think it tries to run a javascript blob.