I guess the business model is to inject their ads into someone else's content, so kinda like Facebook. That would also surely generate more money from the ads than the cost of subscribing to multiple newspapers.
I would expect to see login information rather than "Sign In" and "Subscribe" buttons on archived articles then. Unless they're stripping that from the archive?
I wouldn't be surprised. IIRC, the whole thing is privately funded by one individual, who must have a lot of money to spare.
They can somehow even bypass the Baltimore suns paywalls, and I doubt they have subscriptions to every regional paper, could they?
This is a plausible explanation but is it true?
Theoretically, they could just publish the list of IP ranges that canonically "belongs" to archive.is. That would allow websites to distinguish if a request identifying itself as archive.is is actually from them (it fits one of the IP ranges), or is a fraudster.
Amp pages are paywalled:
https://www.wsj.com/articles/freeze-or-cut-spending-fight-is... https://amp.wsj.com/articles/freeze-or-cut-spending-fight-is...
archive.is isn't: https://archive.md/LaiOX
It might be using other techniques as well for bypassing paywalls, be it referer/user-agent spoofing (some old archives of sites that echo back HTTP request headers have archive.is sending a Referer of google.co.uk).
In that case, the government is fine with it.
It might not be what is happening as there are other ways around, but this is a real possibility for how it could be done. (at least until the websites allowing other user agents decide they want to try to stop archive.is usage, etc)
edit: I think the probability is probably high that they have multiple methods for archiving a website. I think in this post, there are many people stating that they've previously stated they just convert the link to an AMP link and archive it. I'm more so doubtful that's all they do, but it could be it too.
Using the robots.txt file in this way might not be how the author's of the website intended for it to be used. I could see that maybe being used against them in a legal system if someone ever tried to stop them. In the past, I've seen websites state to people creating bots to purposefully change their user agent to one they defined, but, using it for a non-allowed purpose is what I was mentioning. Though, there are multiple ways they could be archiving a website, so this is not necessarily how it is being done.
This is my best guess for this. I’ve really put some thought into this, and this is the best logical assumption I’ve arrived to. I used to be a master of crawlers using Selenium years ago, but that burned me out a little bit so I moved on.
To test my hypothesis, you can go and find any article on Google that you know is probably paywalled. You click the content google shows you, and you navigate into the site, and “bam! Paywall!”..
If it has a paywall for me, well then how did Google crawl and index all the metadata for the SERP if it has a paywall?
I have a long running theory that Archive.is knows how to work around an SEO trick that Google uses to get the content. Websites like the Baltimore Sun don’t want humans to view their content for free, but they do want Googlebot to see it for free.
Google doesn't want everyone to know what a Google indexing request looks like for fear the CEO mafia will institute shenanigans. And the content providers (NYT, WaPo, etc.) don't want people to know 'cause they don't want people evading their paywall.
Or maybe they're okay with letting the archive index their content...
[1] https://developers.google.com/search/docs/crawling-indexing/... [2] https://www.bing.com/webmasters/help/which-crawlers-does-bin...
My guess is that most content providers with paywalls serve the entire content, so search engines can pick it up, and then use scripts to raise the paywall - archive.is takes their snapshot before that happens / doesn't trigger those scripts.
Google recommends using reverse DNS to verify whether a visitor claiming to be Googlebot is legitimate or not: https://developers.google.com/search/docs/crawling-indexing/...
You can also verify IP ownership using WHOIS, or by examining BGP routing tables to see which ASN is announcing the IP range. Google also publishes their IP address ranges here: https://www.gstatic.com/ipranges/goog.json
Just in the past day or so, sfgate.com put in some kind of anti scraping stuff, so urllib, curl, lynx etc. now all fail with 403. Maybe I'll undertake the bigger and slower hassle of trying to read the site with selenium or maybe I'll just give up on newspapers and get my news from HN ;).
I wonder if archive.is has had its sfgate.com experience change. Just had to mention them to stay slightly on topic.
/jk
it's trivial to bypass most paywalls isn't it?
There are your readers who will see a paywall and then pay, and there are your readers who will try to bypass it or simply not read at all. And articles spread through social media attention, and a paywalled article gets much less attention, so it's non-negligibly beneficial to have people read the article for free who would otherwise not read it.
Which is to say: the methods archiv.is uses may not be that special. Clear cookies, block JavaScript, and make deals with or special-case the few sites which actually enforce their paywalls. Or identify yourself as archiv.is, and if others do that to bypass the paywall, good for them.
They need to both allow the full content of their articles to be accessed by crawlers so they can show up in search results, but they also want to restrict access via paywalls. They use 2 main methods to achieve this: javascript DOM manipulation and IP address rate limiting.
Conceivably one could build a system which directly accesses a given document one time from a unique IP address and then cache the HTML version of the page for further serving.
But it's kind of an open secret, you just don't look in the right place.
But I don't know the answer either.