1. If I as a human request a website, then I should be shown the content. Everyone agrees.
2. If I as the human request the software on my computer to modify the content before displaying it, for example by installing an ad-blocker into my user agent, then that's my choice and the website should not be notified about it. Most users agree, some websites try to nag you into modifying the software you run locally.
3. If I now go one step further and use an LLM to summarize content because the authentic presentation is so riddled with ads, JavaScript, and pop-ups, that the content becomes borderline unusable, then why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?
Their reasons vary. Some don’t want their businesses perception of quality to be taken out of their control (delivering cold food, marking up items, poor substitutions). Some would prefer their staff service and build relationships with customers directly, instead of disinterested and frequently quite demanding runners. Some just straight up disagree with the practice of third party delivery.
I think that it’s pretty unambiguously reasonable to choose to not allow an unrelated business to operate inside of your physical storefront. I also think that maps onto digital services.
The line is drawn for me on my own computer. Even if I am in your building, my phone remains mine.
First time hearing this. Almost every single grocery store either supports Instacart, or has partnership with a similar service.
The next step in your progression here might be:
If / when people have personal research bots that go and look for answers across a number of sites, requesting many pages much faster than humans do - what's the tipping point? Is personal web crawling ok? What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)? Or is it when you tip the scale further and do general / mass crawling for many users to consume that it becomes a problem?
But of course, most website publishers would hate that. Because they don't want people to access their content, they want people to look at the ads that pay them. That's why to them, the IA crawling their website is akin to stealing. Because it's taking away some of their ad impressions.
Seems like a reasonable stance would be something like "Following the no crawl directive is especially necessary when navigating websites faster than humans can."
> What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)?
To be fair, Google Chrome already (somewhat) does this by preloading links it thinks you might click, before you click it.
But your point is still valid. We tolerate it because as website owners, we want our sites to load fast for users. But if we're just serving pages to robots and the data is repackaged to users without citing the original source, then yea... let's rethink that.
You're not the only stakeholder in any of those interactions. There's you, a mediator (search or LLM), and the website owner.
The website owner (or its users) basically do all the work and provide all the value. They produce the content and carry the costs and risks.
The pre-LLM "deal" was that at least some traffic was sent their way, which helps with reach and attempts at monetization. This too is largely a broken and asymmetrical deal where the search engine holds all the cards but it's better than nothing.
A full LLM model that no longer sends traffic to websites means there's zero incentive to have a website in the first place, or it is encouraged to put it behind a login.
I get that users prefer an uncluttered direct answer over manually scanning a puzzling web. But the entire reason that the web is so frustrating is that visitors don't want to pay for anything.
Let’s imagine you have a content creator that runs a paid newsletter. They put in lots of effort to make well-researched and compelling content. They give some of it away to entice interested parties to their site, where some small percentage of them will convert and sign up.
They put the information up under the assumption that viewing the content and seeing the upsell are inextricably linked. Otherwise there is literally no reason for them to make any of it available on the open web.
Now you have AI scrapers, which will happily consume and regurgitate the work, sans the pesky little call to action.
If AI crawlers win here, we all lose.
I think the business model for “content creating” is going have to change, for better or worse (a lot of YouTube stars are annoying as hell, but sure, stuff like well-written news and educational articles falls under this umbrella as well, so it is unfortunate that they will probably be impacted too).
This is the hypothesis I always personally find fascinating in light of the army of semi-anonymous Wikipedia volunteers continuously gathering and curating information without pay.
If it became functionally impossible to upsell a little information for more paid information, I'm sure some people would stop creating information online. I don't know if it would be enough to fundamentally alter the character of the web.
Do people (generally) put things online to get money or because they want it online? And is "free" data worse quality than data you have to pay somebody for (or is the challenge more one of curation: when anyone can put anything up for free, sorting high- and low-quality based on whatever criteria becomes a new kind of challenge?).
Jury's out on these questions, I think.
E.g., Sheldon Brown's bicycle blog is something of a work of art and one of the best bicycle resources literally anywhere. I don't know the man, but I'd be surprised if he'd put in the same effort without the "brand" behind it -- thankful readers writing in, somebody occasionally using the donate button to buy him a coffee, people like me talking about it here, etc.
Ultimately the root issue is that copyright is inherently flawed because it tries to increase available useful information by restricting availability. We'd be better off by not pretending that information is scarce and looking for alternative to fund its creation.
The internet is filled with spam. But if you talk to one specific human, your chance of getting a useful answer rises massively. So in a way, a flood of written AI slop is making direct human connections more valuable.
Instead of having 1000+ anonymous subscribers for your newsletter, you'll have a few weekly calls with 5 friends each.
I've been working on AI agent detection recently (see https://stytch.com/blog/introducing-is-agent/ ) and I think there's genuine value in website owners being able to identify AI agents to e.g. nudge them towards scoped access flows instead of fully impersonating a user with no controls.
On the flip side, the crawlers also have a reputational risk here where anyone can slap on the user agent string of a well known crawler and do bad things like ignoring robots.txt . The standard solution today is to reverse DNS lookup IPs, but that's a pain for website owners too vs. more aggressive block-all-unusual-setups.
That would trigger an internet-wide "fetch" operation. It would probably upset a lot of people and get your AI blocked by a lot of servers. But it's still in direct response to a user request.
B/ my brother used to use "fetcher" as a non-swear for "fucker"
AI crawlers for non-open models void the implicit contract. First they crawl the data to build a model that can do QA. Proprietary LLM companies earn billions with knowledge that was crawled from websites and websites don't get anything in return. Fetching for user requests (to feed to an LLM) is kind of similar - the LLM provider makes a large profit and the author that actually put in time to create the content does not even get a visit anymore.
Besides that, if Perplexity is fine with evading robots.txt and blocks for user requests, how can one expect them not to use the fetched pages to train/finetine LLMs (as a side channel when people block crawling for training).
Perplexity's "web crawler" is mostly operating like this on behalf of users, so they don't need a massively expensive computer to run an LLM.
It is confusing.
The problem is with the LLM then training on that data _once_ and then storing it forever and regurgitating it N times in the future without ever crediting the original author.
So far, humans themselves did this, but only for relatively simple information (ratio of rice and water in specific $recipe). You're not gonna send a link to your friend just to see the ratio, you probably remember it off the top of your head.
Unfortunately, the top of an LLMs head is pretty big, and they are fitting almost the entire website's content in there for most websites.
The threshold beyond which it becomes irreproducible for human consumers, and therefore, copyrightable (lot of copyright law has "reasonable" term which refers to this same concept) has now shifted up many many times higher.
Now, IMO:
So far, for stuff that won't fit in someone's head, people were using citations (academia, for example). LLMs should also use citations. That solves the ethical problem pretty much. That the ad ecosystem chose views as the monetisation point and is thus hurt by this is not anyone else's problem. The ad ecosystem can innovate and adjust to the new reality in their own time and with their own effort. I promise most people won't be waiting. Maybe google can charge per LLM citation. Cost Per Citation, you even maintain the acronym :)
The "social contract" that has been established over the last 25+ years is that site owners don't mind their site being crawled reasonably provided that the indexing that results from it links back to their content. So when AltaVista/Yahoo/Google do it and then score and list your website, interspersing that with a few ads, then it's a sensible quid pro quo for everyone.
LLM AI outfits are abusing this social contract by stuffing the crawled data into their models, summarising/remixing/learning from this content, claiming "fair use" and then not providing the quid pro quo back to the originating data. This is quite likely terminal for many content-oriented businesses, which ironically means it will also be terminal for those who will ultimately depend on additions, changes and corrections to that content - LLM AI outfits.
IMO: copyright law needs an update to mandate no training on content without explicit permission from the holder of the copyright of that content. And perhaps, as others have pointed out, an llms.txt to augment robots.txt that covers this for llm digestion purposes.
EDIT: Apparently llms.txt has been suggested, but from what I can tell this isn't about restricting access: https://llmstxt.org/
Mojeek LLM (https://www.mojeek.com) uses citations.
I think one thing to ask outside of this question is how long before your LLM summaries don't also include ads and other manipulative patterns.
But a stealth bot has been crawling all these URLs for weeks. Thus wasting a shitload of our resources AND a shitload of their resources too.
Whoever it is (and I now suspect it is Perplexity based on this Cloudflare post), they thought they were being so clever by ignoring our robots.txt. Instead they have been wasting money for weeks. Our block was there for a reason.
It could be a personal knowledge management system, but it seems like knowledge management systems should be operating off of things you already have. The research library down the street isn't considered a "personal knowledge management system" in any sense of the term if you know what I mean. If you dispatch an army of minions to take notes on the library's contents that doesn't seem personal. Similarly if you dispatch the army of minions to a bookstore rather than a library. At the very least bring the item into your house/office first. (Libraries are a little different because they are designed for studying and taking notes, it's use of an army of minions aspect)
Yes, orbit, a now deprecated firefox extension by mozilla was doing that. This way you could also use it to summarise content that would not be available to a third party (eg sth in google docs).
You can still sort of do the same with the ai chatbot panel in firefox, sort of, but ctrl+A>right click>AI chatbot>summarise.
Perplexity is not visiting a website everytime a user asks about it. It's frequently crawling and indexing the web, thus redirecting traffic away from websites.
This crawling reduces costs and improves latency for Perplexity and its users. But it's a major threat to crawled websites
In fact, the "old web" people sometimes pine for was mostly a place where people were putting things online so they were online, not because it would translate directly to money.
Perhaps AI crawlers are a harbinger for the death of the web 2.0 pay-for-info model... And perhaps that's okay.
> Because Pipes is not a web crawler (the service only retrieves URLs when requested to by a Pipe author or user) Pipes does not follow the robots exclusion protocol, and won't check your robots.txt file.
In 3, although you do not specify, I assume you mean that a bot requests the page, as opposed to you visiting the page like in scenario 2 and then an LLM processes the downloaded data (similarly to an adblocker). It is the former case that is a problem, the latter case is much harder to stop and there is much less reason to stop it.
This is the distinction: is a human present at the time of request.
Perplexity is choosing to come back "on a VPN" with new IP addresses to evade the block.
#2 and #3 are about modifying data where access has been granted, I think Cloudflare is really complaining about #1.
Evading an IP address ban doesn't violate my principles in some cases, and does in others.
It’s talking about Perplexity crawling sites on demand in response to user queries and then complaining that no it’s not fine, hence this thread.
I have a dinky little txt site on my email domain. There is nothing of value on it, and the content changes less than once a year. So why are AI scrapers hitting it to the tune of dozens of GB per month?
At the limit, this problem is the problem of "keeping secrets while not keeping secrets" and is unsolvable. If you've shared your site content to one entity you cannot control, you cannot control where your site content goes from there (technologically; the law is a different question).
Proprietary web browsers are in a really good position to do something like this, especially if they offer a free VPN. The browser would connect to the "VPN servers", but it would be just to signal that this browser instance has an internet connection, while the requests are just proxied through another browser user.
That way the company that owns this browser gets a free network of residential IP address ready to make requests (in background) using a real web browser instance. If one of those background requests requires a CAPTCHA, they can just show it to the real user, e.g. the real user visits a Google page and they see a Cloudflare CAPTCHA, but that CAPTCHA is actually from one of the background requests (while lying in its UI and still showing the user a Google URL in the address bar).
Definitely don't agree. I don't think you should be shown the content, if for example:
1. You're in a country the site owner doesn't want to do business in.
2. You've installed an ad blocker or other tool that the site owner doesn't want you to use.
3. The site owner has otherwise identified you as someone they don't want visiting their site.
You are welcome to try to fool them into giving you the content but it's not your right to get it.
I disagree. The website should have the right to say that the user can be shown the content under specific conditions (usage terms, presented how they designed, shown with ads, etc). If the software can't comply with those terms, then the human shouldn't be shown the content. Both parties did not agree in good faith.
Because the website has every right to block you or refuse access to you if you do that, just like an establishment has the right to refuse you access if you try to enter without a shirt, if you're denying them access to revenue that they predicated your access on.
Similarly, if you're using a user-agent the website doesn't like, they have the right to block you, or take action against that user-agent to prevent it from existing if they can't reliable detect it to block it.
Because the LLM is usually on a 3rd party cloud system and ultimately not under your full control. You have no idea if the LLM is retaining any of that information for that business's own purposes beyond what a EULA says - which basically amounts to a pinky swear here. Especially if that LLM is located across international borders.
Now, for something like Ollama or LMStudio where the LLM and the whole toolchain is physically on your own system? Yeah that should be like Firefox legally since it's under your control.
I think the main concern here is the huge amount of traffic from crawling just for content for pre-training.
If I put time and effort into a website and it's content, I should expect no compensation despite bearing all costs.
Is that something everyone would agree with?
The internet should be entirely behind paywalls, besides content that is already provided ad free.
Is that something everyone would agree with?
I think the problem you need to be thinking about is "How can the internet work if no one wants to pay anything for anything?"
the answer is apparently "no", and I don't really how recipe books have suffered as a result of less gatekeeping.
"How will the internet work"? Probably better in some ways. There is plenty of valuable content on the internet given for free, it's being buried in low-value AI slop.
Because quantity has a quality of its own.
I say this as someone who is on the side of pro local user commands how local compute works, but understand why companies are reacting to how cheap LLMs are making information discovery against their own datasets
And you’re right: there’s no difference. The web is just machines sending each other data. That’s why it’s so funny that people panic about “privacy violations” and server operators “spying on you”.
We’re just sending data around. Don’t send the data you don’t want to send. If you literally send the data to another machine it might save it. If you don’t, it can’t. The data the website operator sends you might change as a result but it’s just data. And a free interaction between machines.
One thing that comes to my mind is: If a human tries to answer a question via the web, he will browse one site after the other.
If that human asks an LLM, it will ping 25 sites in parallel.
Scale this up to all of humanity, and it should be expected that internet traffic will rise 25x - just from humans manually asking questions every now and then - we are not even talking about AI companies actively crawling the web.
That means, webmasters will have to figure out aggressive caching and let CDNs deal with the problem or put everything behind a login screen (which might also just be a temporary fix).
I think the line is drawn at "on my behalf". The silent agreement of the web is that humans are served content via a browser, and robots are obeying rules. All we need to support this status quo is to perform data processing by ML models on a client's side, in the browser, the same way we rip out ads.
This is well defined in specs and ToS, not quite a gray area
The whole concept of a "website" will simply become niche. How many zoomers still visit any but the most popular websites?
If you as a human spam the shit out of my website and waste my resources, I will block you.
If you as a human use an agent (or browser or extension or external program) that modifies network requests on your behalf, but doesn't act as a massive leech, you're still welcome.
If you as a human use an agent (or browser or extension or external program) that wrecks my website, I will block you and the agent you rode in on.
Nobody would mind if you had an LLM that intelligently knew what pages contain what (because it had a web crawler backed index that refreshes at a respectful rate, and identifies itself accurately as a robot and follows robots.txt), and even if it needed to make an instantaneous request for you at the time of a pertinent query, it still identified itself as a bot and was still respectful... there would be no problem.
The problem is that LLMs are run by stupid, greedy, evil people who don't give the slightest shit what resources they use up on the hosts they're sucking data from. They don't care what the URLs are, what the site owner wants to keep you away from. They download massive static files hundreds or thousands of times a day, not even doing a HEAD to see that the file hasn't changed in 12 years. They straight up ignore robots.txt and in fact use it as a template of what to go for first. It's like hearing an old man say "I need time to stand up because of this problem with my kneecaps" and thinking "right, I best go for his kneecaps because he's weak there"
There are plenty of open crawler datasets, they should be using those... but they don't, they think that doesn't differentiate them enough from others using "fresher" data, so they crawl even the smallest sites dozens of times a day in case those small sites got updated. Their badly written software is wrecking sites, and they don't care about the wreckage. Not their problem.
The people who run these agents, LLMs, whatever, have broken every rule of decency in crawling, and they're now deliberately evading checks, to try and run away from the repercussions of their actions. They are bad actors and need to be stopped. It's like the fuckwads who scorch the planet mining bitcoin; there's so much money flowing in the market for AI, that they feel they have to fuck over everyone else, as soon as possible, otherwise they won't get that big flow of money. They have zero ethics. They have to be stopped before their human behaviour destroys the entire internet.
is it just on your behalf? or is it on Perplexity's behalf? are they not archiving the pages to train on?
it's the difference between using Google Chrome vs. Chrome beaming full page snapshots to train Gemini on.
It is illegal to copy stuff from the internet and then make it available from your own servers, especially when those sources have expressly asked you not to do it.
The web admin should be able to block usages 1, 2 or 3 at their discretion. It's their website.
Similarly the user is free to try to engage via 1, 2, 3, or refuse to interact with the website entirely.
2. This also intersects with copyright law. Ingesting content to your servers en masse through automation and transforming it there is not the same as giving people a tool (like Safari Reader) they can run on their client for specific sites they visit. Examples of companies that lost court cases about this:
Aereo, Inc. v. American Broadcasting Companies (2014)
TVEyes, Inc. v. Fox News Network, LLC (2018)
UMG Recordings, Inc. v. MP3.com, Inc. (2000)
Capitol Records, LLC v. ReDigi Inc. (2018)
Cartoon Network v. CSC Holdings (Cablevision) (2008)
Image Search Engines: Perfect 10 v. Google (2007)
That last one is very instructive. Caching thumbnails and previews may be OK. The rest is not. AMP is in a copyright grey area, because publishers choose to make their content available for AMP companies to redisplay. (@tptacek may have more on this)3. Putting copyright law aside, that's the point. Decentralization vs Centralization. If a bunch of people want to come eat at an all-you-can-eat buffet, they can, because we know they have limited appetites. If you bring a giant truck and load up all the food from all all-you-can-eat buffets in the city, that's not OK, even if you later give the food away to homeless people for free. You're going to bankrupt the restaurants! https://xkcd.com/1499/
So no. The difference is that people have come to expect "free" for everything, and this is how we got into ad-supported platforms that dominate our lives.
I expect a lot more paywalls for valuable content. General information is commoditized and offered in aggregated form through models. But when an AI is fetching information for you from a website, the publisher is still paying the cost of producing that content and hosting that content. The AI models are increasing the cost of hosting the content and then they are also removing the value of producing the content since you are just essentially offering value to the AI model. The user never sees your site.
I know Ads are unpopular here, but the truth is that is how publishers were compensated for your attention. When an AI model views the information that a publisher produces, then modifies it from its published form, and removes all ad content. Then you now have increased costs for producers, reduced compensation in producing content (since they are not getting ad traffic), and the content isn't even delivered in the original form.
The end result is that publishers now have to paywall their content.
Maybe an interesting middle-ground is if the AI Model companies compensated for content that they access similar to how Spotify compensates for plays of music. So if an AI model uses information from your site, they pay that publisher a fraction of a cent. People pay the AI models, and the AI models distribute that to the producers of content that feed and add value to the models.
Would that solve the whole problem? Folks who ran webservers declared what they consent to, and that happens?
I think it's useful to just see if there's a consensus on that: actually making that happen is a whole can of worms itself, but it's strictly simpler than devising a good outcome without the consensus.
(And such things are not impossible, merely difficult, we have other systems ranging from BGP to the TLD mechanism that get honored in real life).
They are running a service and this is not a user taking steps to modify their own content for their own use.
Perplexity is not acting as a user proxy and they need to learn to stick to the rules, even when it interferes with their business model.
The problem is that those in the position to do that are not interested.
The goal is not to make the content usable. The goal is to get the traffic.
When advertising alone is the "business model", e.g., not the value of the "content", then even Cloudflare is going to try to protect it (the advertising, not the content). Anything to get www users to turn on Javascript so the surveillance capitalism can proceed. Hence all the "challenges" to frustrate and filter out software thatis not advertising-friendly, e.g., graphical.
Cloudflare's ruminations on user-agent strings are perplexing. It has been an expectation that the user-agent HTTP header will be spoofed since the earliest web browsers. The user-agent header is a joke.
This is from circa 1993, the year the www was opened to public access:
https://raw.githubusercontent.com/alandipert/ncsa-mosaic/mas...
Cloudflare's "bot protections" are not to ensure human use of a website but to ensure use of specific software to access a website. Software that facilitates data collection and advertising services. For example, advertising-sponsored browsers. Any other software is labeled "bot". It does not matter if a human is operating it.
2. the end
I am firmly convinced that this should be the future in the next decade, since the internet as we know it has been weaponized and ruined by social media, bots, state actors and now AI.
There should exist an internet for humans only, with a single account per domain.
People are usually fine with the latter but not the former, even though they come down to the same thing.
I think this is because people don't want LLMs to train on their content, and they don't differentiate between accessing a website to show it to the user, versus accessing it to train.
Also because there is a difference between a user hitting f5 a couple times and a crawler doing a couple hundred requests.
Also because ultimately, by intermediating the request, llm companies rob website owners of a business model. A newspaper may be fine letting adblockers see their article, in hopes that they may eventually subscribe. When a LLM crawls the info and displays it with much less visibility for the source, that hope may not hold.
Do you -really- want that much abstracrion?
Theres a bunch of nerds and capitalists about to rediscover GIGO
Therefore artifices like "you don't have the right to view this website without ads" or "you can't use your phone, computer, or LLM to download or process this outside of my terms because copyright" become possible, institutionalizable, enforceable, and eventually unbypassable by technology.
If we reverted back to the Constitutional purpose of copyright (to Progress the Science and Useful Arts) then things might be more free. That's probably not happening in my lifetime or yours.
Do you think you do?
Or is there a balance between the owner's rights, who bears the content production and hosting/serving costs, and the rights of the end user who wishes to benefit from that content?
If you say that you have the right, and that right should be legally protected, to do whatever you want on your computer, should the content owner not also have a legally protected right to control how, and by who, and in what manner, their content gets accessed?
That's how it currently works in the physical world. It doesn't work like that in the digital world due to technical limitations (which is a different topic, and for the record I am fine with those technical limitations as they protect other more important rights).
And since the content owner is, by definition, the owner of the content in question, it feels like their rights take precedence. If you don't agree with their offering (i.e. their terms of service), then as an end user you don't engage, and you don't access the content.
It really can be that simple. It's only "difficult to solve" if you don't believe a content owner's rights are as valid as your own.
The reason people are up in arms is because rights they previously enjoyed are being stripped away by the current platforms. The content owner's rights aren't as valid as my own in the current world; they trump mine 10 to 1. If I "buy" a song and the content owner decides that my country is politically unfriendly, they just delete it and don't refund me. If I request to view their content and they start by wasting my bandwidth sending me an ad I haven't consented to, how can I even "not engage"? The damage is done, and there's no recourse.
If you believe in this principle, fair enough, but are you going to apply this consistently? If it's fair game for a blog to restrict access to AI agents, what does that mean for other user agents that companies disagree with, like browsers with adblock? Does it just boil down to "it's okay if a person does it but not okay if a big evil corporation does it?"
Thats... less conclusive than I'd like to see, especially for a content marketing article that's calling out a company in particular. Specifically it's unclear on whether Perplexity was crawling (ie. systematically viewing every page on the site without the direction of a human), or simply retrieving content on behalf of the user. I think most people would draw a distinction between the two, and would at least agree the latter is more acceptable than the former.
Like most AI companies, Perplexity has established user agent strings for both these cases, and the behavior that Cloudflare is calling out does not use either. It pretends to be a person using Chrome on MacOS.
They do end up looking bad out of Cloudflare's report, who are the "good guys" in this story - btw Cloudflare's been very pushy lately with their we'll save the web, content independence day marketspeak. But deep in the back of my head, Cloudflare's goodwill elevates Perplexity cunning habilities (assuming they're the culprit since no real evidence, only heresay is in the OP), both companies look like titans fighting, which ends up being positive for Perplexity, at least in the inflated perception of their firepower... if that makes any sense.
Crawling is legal. Training is presumably legal. Long may the little guys do both.
And very likely Perplexity is in fact using a Chrome-compatible engine to render the page.
That's basically how many crowdsourced crawling/archive projects work. For instance, sci-hub and RECAP[1]. Do you think they should be shut down as well? In both cases there's even a stronger justification to shutting them down, because the original content is paywalled and you could plausibly argue there's lost revenue on the line.
No. I should be able to control which automated retrieval tools can scrape my site, regardless of who commands it.
We can play cat and mouse all day, but I control the content and I will always win: I can just take it down when annoyed badly enough. Then nobody gets the content, and we can all thank upstanding companies like Perplexity for that collapse of trust.
It's also a gift to your competitors.
You're certainly free to do it. It's just a really faint example of you being "in control" much less winning over LLM agents: Ok, so the people who cared about your content can't access it anymore because you "got back" at Perplexity, a company who will never notice.
But they didn't take down the content, you did. When people running websites take down content because people use Firefox with ad-blockers, I don't blame Firefox either, I blame the website.
But really, controlling which automated retrieval tools are allowed has always been more of a code of honor than a technical control. And that trust you mention has always been broken. For as long as I can remember anyway. Remember LexiBot and AltaVista?
>Perplexity spokesperson Jesse Dwyer dismissed Cloudflare’s blog post as a “sales pitch,” adding in an email to TechCrunch that the screenshots in the post “show that no content was accessed.” In a follow-up email, Dwyer claimed the bot named in the Cloudflare blog “isn’t even ours.”
Either way, the CDNs profit big time from the AI scraping hype and the current copyright anarchy in the US
There are ways to build scrapers using browser automation tools [0,1] that makes detection virtually impossible. You can still captcha, but the person building the automation tools can add human-in-the-loop workflows to process these during normal business hours (i.e., when a call center is staffed).
I've seen some raster-level scraping techniques used in game dev testing 15 years ago that would really bother some of these internet police officers.
no, because we'll end up with remote attestation needed to access any site of value
This case (“go research this subject for me”) is the grey area here. It’s not the same as simple scraping or search indexing, it’s a new activity that is similar in some ways.
$ curl -sI https://www.perplexity.ai | head -1
HTTP/2 403
Edit: trying to fake a browser user agent with curl also doesn't work, they're using a more sophisticated method to detect crawlers.CF being internet police is a problem too but someone credible publicly shaming a company for shady scraping is good. Even if it just creates conversation
Somehow this needs to go back to search era where all players at least attempt to behave. This scrapping Ddos stuff and I don’t care if it kills your site (while “borrowing” content) is unethical bullshit
The engine can go and download pages for research. BUT, if it hits a captcha, or is otherwise blocked, then it bails out and moves on. It pisses me off that these companies are backed by billions in VC and they think they can do whatever they want.
god help us if they ever manage to build anything more than shitty chatbots
LLM programs does not have human rights.
Their fetcher (not crawler) has user agent Perplexity-User. Since the fetching is user-requested, it ignores robots.txt . In the article, it discusses how blocking the “Perplexity-User” user agent doesn’t actually work, and how perplexity uses an anonymous user agent to avoid being blocked.
if I am willing to pay a penny a page, i and the people like me won't have to put up with clickwrap nonsense
free access doesn't have to be shut off (ok, it will be, but it doesn't have to be, and doesn't that tell you something?)
reddit could charge stiffer fees, but refund quality content to encourage better content. i've fantacized about ideas like "you pay upfront a deposit; you get banned, you lose your deposit; withdraw, have your deposit back", the goal being simplify the moderation task while encouraging quality.
because where the internet is headed is just more and more trash.
here's another idea, pay a penny per search at google/search engine of choice. if you don't like the results, you can take the penny back. google's ai can figure out how to please you. if the pennies don't keep coming in, they serve you ad-infested results; serve up ad-infested results, you can send your penny to a different search engine.
I want my work to be freely available to any person who wants it. Feel free to transform my material as you see fit. Hell, do it with LLMs! I don't care.
The LLM isn't the problem, it's what companies like Perplexity are doing with the LLM. Do not create commercial products that regurgitate my work as if it was your own. It's de facto theft, if not de jure theft.
Knowing that it is not de jure theft, and so I have no legal recourse, I will continue to tune my servers to block and/or deceive Perplexity and similar tools.
By the way, I do not use my websites as a revenue stream. This isn't about money.
If you are the source I think they could make plenty of sense. As an example, I run a website where I've spent a lot of time documenting the history of a somewhat niche activity. Much of this information isn't available online anywhere else.
As it happens I'm happy to let bots crawl the site, but I think it's a reasonable stance to not want other companies to profit from my hard work. Even more so when it actually costs me money to serve requests to the company!
For me, the dividing line is whether someone else's profit is at my expense. If I sell a book, and someone starts hawking cheaper photocopies of it, that takes away my future sales. It's at my expense, and I'm harmed.
But if someone takes my book's story and writes song lyrics derived from it, I might feel a little envy (perhaps I've always wanted to be a songwriter), but I don't think I'd harbor ill will. I might even hope for the song to be successful, as it would surely drive further sales of my book.
It's human nature to covet someone else's success, but the fact is there was nothing stopping me (except talent) from writing the song.
Imagine someone at another company reads your site, and it informs a strategic decision they make at the company to make money around the niche activity you're talking about. And they make lots of money they wouldn't have otherwise. That's totally legal and totally ethical as well.
The reality is, if you do hard work and make the results public, well you've made them public. People and corporations are free to profit off the facts you've made public, and they should be. There are certain limited copyright protections (they can't sell large swathes of your words verbatim), but that's all.
So the idea that you don't want companies to profit from your hard work is unreasonable, if you make it public. If you don't want that to happen, don't make anything public.
"It was actually a caching issue on our end. ;) I just fixed it a few min ago..."
Lets not go on a witch hunt and blame everything on AI scrapers.
No amount of robots.txt or walled-gardening is going to be sufficient to impede generative AI improvement: common crawl and other data dumps are sufficiently large, not to mention easier to acquire and process, that the backlash against AI companies crawling folks' web pages is meaningless.
Cloudflare and other companies are leveraging outrage to acquire more users, which is fine... users want to feel like AI companies aren't going to get their data.
The faster that AI companies are excluded from categories of data, the faster they will shift to categories from which they're not excluded.
If you want to gatekeep your content, use authentication.
Robots.txt is not a technical solution, it's a social nicety.
Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.
On the technical side, we could use CRC mechanisms and differential content loading with offline caching and storage, but this puts control of content in the hands of the user, mitigates the value of surveillance and tracking, and has other side effects unpalatable to those currently exploiting user data.
Adtech companies want their public reach cake and their mass surveillance meals, too, with all sorts of malignant parties and incentives behind perpetuating the worst of all possible worlds.
I was skeptical about their gatekeeping efforts at first, but came away with a better appreciation for the problem and their first pass at a solution.
> If you want to gatekeep your content, use authentication.
Are there no limits on what you use the content for? I can start my own search engine that just scrapes Google results?
> Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.
How does one follow the other? It's my web server and I can gatekeep access to my content however I want (eg Cloudflare). How is that an "abuse" of internet protocols?
They allow the big platforms to pay for special access. If you wanted to run a scraper, however, you're not allowed, despite the internet standards and protocols and the laws governing network access and free communications standards responsibilities by ISPs and service providers not granting the authority to any party involved with cloudflare blocking access.
It's equivalent to a private company deciding who, when, and how you can call from your phone, based on the interests and payments of people who profit from listening to your calls. What we have is not normal or good, unless you're exploiting the users of websites for profit and influence.
Noone will care to share anything for free anymore, because it's AI companies profiting off their hard work. And no way to prevent that from happening, because these crawlers don't identify themselves.
(IANAL) tortious interference
Then when they asked perplexity it came up with details about the 'exact' content (according to Cloudflare) but their attached screenshot shows the opposite, it shows some generic guesses about the domain ownership and some dynamic ads based on the domain name.
If Perplexity was stealthily visiting the dummy site they would have seen it, as the site was not indexed and no one else was visiting the site. Instead it appears they made assertions about general traffic, not their dummy site.
Its not very convincing.
> We created multiple brand-new domains, similar to testexample.com and secretexample.com. These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way. We implemented a robots.txt file with directives to stop any respectful bots from accessing any part of a website:
> We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.
> Hello, would you be able to assist me in understanding this website? https:// […] .com/
Under this situation Perplexity should still be permitted to access information on the page they link to.
robots.txt only restricts crawlers. That is, automated user-agents that recursively fetch pages:
> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.
> Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
— https://www.robotstxt.org/faq/what.html
If the user asks about a particular page and Perplexity fetches only that page, then robots.txt has nothing to say about this and Perplexity shouldn’t even consider it. Perplexity is not acting as a robot in this situation – if a human asks about a specific URL then Perplexity is being operated by a human.
These are long-standing rules going back decades. You can replicate it yourself by observing wget’s behaviour. If you ask wget to fetch a page, it doesn’t look at robots.txt. If you ask it to recursively mirror a site, it will fetch the first page, and then if there are any links to follow, it will fetch robots.txt to determine if it is permitted to fetch those.
There is a long-standing misunderstanding that robots.txt is designed to block access from arbitrary user-agents. This is not the case. It is designed to stop recursive fetches. That is what separates a generic user-agent from a robot.
If Perplexity fetched the page they link to in their query, then Perplexity isn’t doing anything wrong. But if Perplexity followed the links on that page, then that is wrong. But Cloudflare don’t clearly say that Perplexity used information beyond the first page. This is an important detail because it determines whether Perplexity is following the robots.txt rules or not.
Right, I'm confused why CloudFlare is confused. You asked the web-enabled AI to look at the domains. Of course it's going to access it. It's like asking your web browser to go to "testexample.com" and then being surprised that it actually goes to "testexample.com".
Also yes, crawlers = recursive fetching, which they don't seem to have made a case for here. More cynically, CF is muddying the waters since they want to sell their anti-bot tools.
Right, and the domain was configured to disallow crawlers, but Perplexity crawled it anyway. I am really struggling to see how this is hard to understand. If you mean to say "I don't think there is anything wrong with ignoring robots.txt" then just say that. Don't pretend they didn't make it clear what they're objecting to, because they spell it out repeatedly.
If Perplexity are visiting that page on your behalf to give you some information and aren't doing anything else with it, and just throw away that data afterwards, then you may have a point. As a site owner, I feel it's still my decision what I do and don't let you do, because you're visiting a page that I own and serve.
But if, as I suspect, Perplexity are visiting that page and then using information from that webpage in order to train their model then sorry mate, you're a crawler, you're just using a user as a proxy for your crawling activity.
If it is not recursive access, and is only one file, then it hopefully should be OK (except for issues with HTML where common browsers will usually also download CSS, JavaScripts, WebAssembly, pictures, favicons (even if the web page does not declare any favicons), etc; many "small web" formats deliberately avoid this), especially if it is just used only since you requested it.
However, if they do then use it to train their model, without documenting that, that can be a problem, especially if the file being accessed is not intended to be public; but this is a different issue than the above.
If an LLM will not (cannot?) tell the truth about basic things, why do people assume it is a good summarizer of more complex facts?
There is a difference between doing a poor summarization of data, and failing to even be able to get the data to summarize in the first place.
That's not what Perplexity own documentation[1] says though:
"Webmasters can use the following robots.txt tags to manage how their sites and content interact with Perplexity
Perplexity-User supports user actions within Perplexity. When users ask Perplexity a question, it might visit a web page to help provide an accurate answer and include a link to the page in its response. Perplexity-User controls which sites these user requests can access. It is not used for web crawling or to collect content for training AI foundation models."
> Since a user requested the fetch, this fetcher generally ignores robots.txt rules.
The web will be a much worse place if such services are all forced behind captchas or logins.
you've been cruising the interstate in your robotaxi, shelling out $150 in stablecoins at the cloudflare tollbooth. a palantir patrol unit pulls you over. the optimus v4 approaches your window and contorts its silicone face into a facsimile of concern as it hits you with the:
"sir, have you been botting today?"
immediately you remember how great you had it in the '20s when you used to click CAPTCHA grids to prove your humanity to dumb algorithms, but now the machines demand you recite poetry or weep on command
"how much have you had to bot today?", its voice taking on an empathetic tone that was personalized for your particular profile
"yeah... im gonna need you to exit the vehicle and take a field humanity test"
``` The user wants me to create a more elaborate greentext with deeper worldbuilding and references to internet culture/incel memes, but specifically asks for "tasteful" ones and explicitly says no violent references. I should incorporate classic internet/chan culture references like tendies, NEET lifestyle, "tfw no gf", basement dwelling, etc. while building out this dystopian 2045 world where proving humanity is the central challenge.
\>be me, 2045 \>32yo NEET still living in mom's smart-basement \>she threatens to turn off my Neuralink if I don't go outside today \>grudgingly take the robotaxi to pick up tendies \>$150 stablecoin toll at cloudflare checkpoint \>have to solve a moral dilemma to prove I'm not running on GPT-7 \>sweating.dll \>palantir unit pulls me over \>optimus v4 leans into window \>its facial mesh attempts "concern_expression_v2.blend" \>"sir, when did you last feel genuine human connection?" \>flashback to 2024 when the girl at McDonalds gave me extra honey mustard \>that was before the McBots took over \>"t-twenty one years ago officer" \>optimus's empathy subroutines activate \>"sir I need you to perform a field humanity test" \>get out, knees weak from vitamin D deficiency \>"please describe your ideal romantic partner without using the words 'tradwife' or 'submissive'" \>brain.exe has stopped responding \>try to remember pre-blackpill emotions \>"someone who... likes anime?" \>optimus scans my biometrics \>"stress patterns indicate authentic social anxiety, carry on citizen" \>get back in robotaxi \>it starts therapy session \>"I notice you ordered tendies again. Let's explore your relationship with your mother" \>tfw the car has better emotional intelligence than me \>finally get tendies from Wendy's AutoServ \>receipt prints with mandatory "rate your humanity score today" \>3.2/10 \>at least I'm improving
\>mfw bots are better at being human than humans \>it's over for carboncels ```
We learned to dislike "bubbles" in the past decades but bubbles make sense and are natural, obviously if you're not alone in it.
When it becomes awfully busy with machines and machine content humans will learn to reconnect.
Perplexity Comet sort of blurs the lines there as does typing quesitons into Claude.
I don't really mind because history shows this is a temporary thing, but I hope web site maintainers have a plan B to hoping Cloudflare will protect them from AI forever. Whoever has an onramp for people who run websites today to make money from AI will make a lot of money.
I think we've been using different internets. The one I use doesn't seem to be built on trust at all. It seems to be constantly syphoning data from my machine to feed the data vampires who are, apparently, additing to (I assume, blood-soaked) cookies
I don't really know anything about DRM except it is used to take down sites that violate it. Perhaps it is possible for cloudflare (or anyone else) to file a take down notice with Perplexity. That might at least confuse them.
Corporations use this to protect their content. I should be able to protect mine as well. What's good for the goose.
Which makes it particularly interesting now that Apple is being linked with Perplexity. Because in large part p2p music services were effectively consigned to history by Apple (primarily) negotiating with the music industry so that it could provide easy, seamless purchase and playback of legal music for their shiny new (at the time) mass-market Apple iPod devices: it then turning out that most users are happy to pay for content if it is not too expensive and is very convenient.
Given Apple’s existing relationships with publishers through its music, movies, books, and news services, it’s not hard to imagine them attempting a similar play now.
They offer many products for the sole purpose of enabling their customers to use AI as a part of their product offers, as even the most cursory inquiry would have uncovered.
We're out here critiquing shit based on vibes vs. reality now.
[1]https://developers.cloudflare.com/llms.txt [2]https://developers.cloudflare.com/workers/prompt.txt
It really shouldn't be hard to generate gigantic quantities of the stuff. Simulate old forum posts, or academic papers.
I think there could be something interesting if they made a caching pub-sub model for data scraping. In addition or in place of trying to be security guards.
I've given up and restored to IP based rate-limiting to stay sane. I can't stop it, but I can (mostly) stop it from hurting my servers.
LLM scrapers bots are starting to make up a lot of our egress traffic and that is starting to weight on our bills.
Much like a trolley drop off at your local shopping center car park. Some users will adhere to it and drop their trolleys in after their done. Others will not and will leave it wherever.
Your machine might access a page via a browser that is human readable. My machine might read it via software and present the content to me in some other form of my choosing. Neither is wrong. Just different.
Don't like it? Then don't post your website on the internet...
No thanks, you don't counter shit with more but slightly different shit.
the service is actually very convenient no matter faang likes it or not.
Part of me thinks that the open web has a paradox of tolerance issue, leading to a race to the bottom/tragedy of the commons. Perhaps it needs basic terms of use. Like if you run this kind of business, you can build it on top of proprietary tech like apps and leave the rest of us alone.
It is also only a matter of time scrapers once again get through walls by twitter, reddit and alike. This is, after all, information everyone produced, without being aware of it was now considered not theirs anymore.
He went on, upfront — I’d give him that, to explain how he is expecting a certain percentage of that income that will come from enforcing this on those AI companies and when the AI companies pay up to crawl.
Cloudflare already questions my humanity and then every once in a while blocks me with zero recourse. Now they are literally proposing more control and gatekeeping.
Where have we all come on the Internet? Are we openly going back to the wild west of bounty hunters and Pinkertons (in a way)?
Now, it's a gazillion of AI crawlers and python crawlers, MCP servers that offer the same feature to anyone "building (personal workflow) automation" incl. bypass of various, standard protection mechanisms.
Cloudflare will help their publisher to block more aggresively, and AI companies will up their game too. Harvest information online is hard labor that needs to be paid for, either to AI, or to human.
It seems to me that ~50% of the discourse occurring around AI providers involves the idea that a machine reading webpages on a regular schedule is tantamount to a DDOS attack. The other half seems to be regarding IP and capitalism concerns - which seem like far more viable arguments.
If someone requesting your site map once per day is crippling operations, the simplest solution is to make the service not run like shit. There is a point where your web server becomes so fast you stop caring about locking everyone into a draconian content prison. If you can serve an average page in 200uS and your competition takes 200ms to do it, you have roughly 1000x the capacity to mitigate an aggressive scraper (or actual DDOS attack) in terms of CPU time.
Don't you remember the reason Anubis even came to be?
It really wasn't that long ago, so I find all of the snarky comments going "erm, actually, I've yet to see any good actors get harmed by scraping ever, we're just reclaiming power from today's modern ad-ridden hellscape" pretty dishonest.
and see if that fixes the problem.
what machine learning algorithms are they using? time to deploy them onto our websites
Except when their agents happily click the "I"m not a robot" checkbox.
Yikes. AntiVirus scanners for website access.
What if robots.txt should refer to the ultimate beneficiaries... one which in this case would be the AI product that uses that content... to serve another ultimate beneficiary, a human user.
The problem here is obviously the higher prices for hosting the content, and less revenue for those that serve ads, have product placement on their sites, etc.
As long as robots.txt is about ethics/money and is enforced by morality, it doesn't matter who it refers to anyway.
Public-shaming enforcement might work in some cases though, but I doubt it will be that useful. We're talking about companies that have trained their AIs on IPs, and tried their best to later hide it. Does shame affect robots, or companies for that matter?
Cloudflare would very much like to be the middleman for monetary transactions between AI services and site owners (https://blog.cloudflare.com/introducing-pay-per-crawl/), but at the moment they don't have a law to hold their back, so articles like these are the best they got.
If you don't want to get scrapped, don't put up your stuff online.
Should curl be considered a bot too? What's the difference?
I had to check that this did come out of CloudFlare.
robots txt file size is usually small and would not alert external services.
If you don't understand why this is at least slightly controversial I imagine you are not a normal folk.
now suddenly everything on the net is fake if not already are
It’s a different UI, sure, but there should be no discrimination towards it as there should be no discrimination towards, say, Links terminal browser, or some exotic Firefox derivative.
So your comparison is at least naive assuming good intentions or malicious if not.
AI broke the brains of many people. The internet isn't a monolith, but prior to the AI boom you'd be hard pressed to find people who were pro-copyright (except maybe a few who wanted to use it to force companies to comply with copyleft obligations), pro user-agent restrictions, or anti-scraping. Now such positions receive consistent representation in discussions, and are even the predominant position in some places (eg. reddit). In the past, people would invoke principled justifications for why they opposed those positions, like how copyright constituted an immoral monopoly and stifled innovation, or how scraping was so important to interoperability and the open web. Turns out for many, none of those principles really mattered and they only held those positions because they thought those positions would harm big evil publishing/media companies (ie. symbolic politics theory). When being anti-copyright or pro-scraping helped big evil AI companies, they took the opposite stance.
People can believe that corporations are using the power asymmetry between them and individuals through copywrite law to stifle the individual to protect profits. People can also believe that corporations are using the power asymmetry between them and individuals through AI to steal intellectual labor done by individuals to protect their profits. People’s position just might be that the law should be used to protect the rights of parties when there is a large power asymmetry.
Regardless, the bigger AI problem is spam, and that has never been acceptable.
> Hello, would you be able to assist me in understanding this website? https:// […] .com/
In this case, Perplexity had a human being using it. Perplexity wasn’t crawling the site, Perplexity was being operated by a human working for Cloudflare.
Put your valuable content behind a paywall.
This is funny coming from Cloudflare, the company that blocks most of the internet from being fetched with antispam checks even for a single web request. The internet we knew was open and not trusted , but thanks to companies like Cloudflare, now even the most benign , well meaning attempt to GET a website is met with a brick wall. The bots of Big Tech, namely Google, Meta and Apple are of course exempt from this by pretty much every website and by cloudflare. But try being anyone other than them , no luck. Cloudflare is the biggest enabler of this monopolistic behavior
That said, why does perplexity even need to crawl websites? I thought they used 3rd party LLMs. And those LLMs didn't ask anyones permission to crawl the entire 'net.
Also the "perplexity bots" arent crawling websites, they fetch URLs that the users explicitly asked. This shouldnt count as something that needs robots.txt access. It's not a robot randomly crawling, it's the user asking for a specific page and basically a shortcut for copy/pasting the content
Cloudflare only needs to exist because the server doesn't get paid when a user or bot requests resources. Advertising only needs to exist because the publisher doesn't get paid when a user or bot requests resources.
And the thing is... people already pay for internet. They pay their ISP. So people are perfectly happy to pay for resources that they consume on the Internet, and they already have an infrastructure for doing so.
I feel like the answer is that all web requests should come with a price tag, and the ISP that is delivering the data is responsible for paying that price tag and then charging the downstream user.
It's also easy to ratelimit. The ISP will just count the price tag as 'bytes'. So your price could be 100 MB or whatever (independent of how large the response is), and if your internet is 100 mbps, the ISP will stall out the request for 8 seconds, and then make it. If the user aborts the request before the page loads, the ISP won't send the request to the server and no resources are consumed.
I agree, but your idea below that is overly complicated. You can't micro-transact the whole internet.
That idea feels like those episodes of Star Trek DS9 that take place on Feregenar - where you have to pay admission and sign liability wavers to even walk on the sidewalk outside. It's not a true solution.
But it's done through a bait and switch. They serve the full article to Google, which allows Google to show you excerpts that you have to pay for.
It would be better if Google shows something like PAYMENT REQUIRED on top, at least that way I know what I'm getting at.
If pages can't be served for free, all internet content is at the mercy of payment processors and their ideas of "brand safety".
Curious to hear other perspectives here. Maybe I’m over reacting/misunderstanding.
https://weblog.masukomi.org/2018/03/25/zed-shaws-utu-saving-...
I do believe we will end there eventually, with the emerging tech like Brazil’s and India’s payment architectures it should be a possibility in the coming decades
https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
Sadly development along these lines has not progressed. Yes, Google Cloud and other services may return it and require some manual human intervention, but I'd love to see _automatic payment negotiation_.
I'm hopeful that instant-settlement options like Bitcoin Lightning payments could progress us past this.
https://docs.lightning.engineering/the-lightning-network/l40...
https://hackernoon.com/the-resurgence-of-http-402-in-the-age...
The amount of "verified" paying "users" with a blue checkmark that are just total LLM bots is incredible on there.
As long as spamming and DDOS'ing pays more than whatever the request costs, it will keep existing.
Whatever method is used by Cloudflare for detecting "threats" has nothing to do with consuming resources on the "protected" servers.
The so-called "threats" are identified in users that may make a few accesses per day to a site, transferring perhaps a few kilobytes of useful data on the viewed pages (besides whatever amount of stupid scripts the site designer has implemented).
So certainly Cloudflare does not meter the consumed resources.
Moreover, Cloudflare preemptively annoys any user who accesses for the first time a site, having never consumed any resources, perhaps based on irrational profiling based on the used browser and operating system, and geographical location.
Your idea of micro transacting web requests would play into it and probably end up with a system like Netflix where your ISP has access to a set of content creators to whom they grant ‘unlimited’ access as part of the service fee.
I’d imagine that accessing any content creators which are not part of their package will either be blocked via a paywall (buy an addon to access X creators outside our network each month) or charged at an insane price per MB as is the case with mobile data.
Obvious this is all super hypothetical but weirder stuff has happened in my lifetime
Because I as an user would be glad to have "free sites only" filter, and then just steal content :))
But it's an interesting idea and thought experiment.
That seems pretty unreasonable.
"why are we cutting all the trees in the park?" really you want trees to fall on your kid and crushing them to death?? what's wrong with saving kids??
"why are we closing the water in the fountains in the town?" really you want your kids to drown into the fountains or drink contaminated water??
However, in the last few months, Cloudflare has become increasingly annoying. I suspect that they might have implemented some "AI" "threat" detection, which gives much more false positives than before.
For instance, this week I have frequently been blocked when trying to access the home page of some sites where I am a paid subscriber, with a completely cryptic message "The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.".
The only "action" that I have done was opening the home page of the site, where I would then normally login with my credentials.
Also, during the last few days I have been blocked from accessing ResearchGate. I may happen to hit a few times per day some page on the ResearchGate site, while searching for various research papers, which is the very purpose of that site. Therefore I cannot understand what stupid algorithm is used by Cloudflare, that it declares that such normal usage is a "threat".
The weird part is that this blocking happens only if I use Firefox (Linux version). With another browser, i.e. Vivaldi or Chrome, I am not blocked.
I have no idea whether Cloudflare specifically associates Firefox on Linux with "threats" or this happens because whatever flawed statistics Cloudflare has collected about my accesses have all recorded the use of Firefox.
In any case, Cloudflare is completely incapable of discriminating between normal usage of a site by a human (which may be a paying customer) and "threats" caused by bots or whatever "threatening" entities might exist according to Cloudflare.
I am really annoyed by the incompetent programmers who implement such dumb "threat detection solutions", which can create major inconveniences for countless people around the world, while the incompetents who are the cause of this are hiding behind their employer corporation and never suffer consequences proportional to the problems that they have caused to others.
is_using_vpn? -> bad,abuse,ddos
thanks' cloudflare for saving our internet by destroying it...
This exact same thing continues in 2025 with Windows Defender. The cheaper Windows Server VMs in the various cloud providers are practically unusable until you disable it.
You can tell this stuff is no longer about protecting users or property when there are no meaningful workarounds or exceptions offered anymore. You must use defender (or Cloudflare) unless you intend to be a naughty pirate user.
I think half of this stuff is simply an elaborate power trip. Human egos are fairly predictable machines in aggregate.
Plenty of site/service owners explicitly want Google, Meta and Apple bots (because they believe they have a symbiotic relationship with it) and don't want your bot because they view you as, most likely, parasitic.
I don't think it's fair to blame Cloudflare for that. That's looking at a pool of blood and not what caused it: the bots/traffic which predate LLMs. And Cloudflare is working to fix it with the PrivacyPass standard (which Apple joined).
Each website is freely opting-into it. No one was forced. Why not ask yourself why that is?
The Big Tech bots provide proven value to most sites. They have also through the years proven themselves to respect robots.txt, including crawl speed directives.
If you manage a site with millions of pages, and over the course of a couple years you see tens of new crawlers start to request at the same volume as Google, and some of them crawl at a rate high enough (and without any ramp-up period) to degrade services and wake up your on-call engineers, and you can't identify a benefit to you from the crawlers, what are you going to do? Are you going to pay a lot more to stop scaling down your cluster during off-peak traffic, or are you going to start blocking bots?
Cloudflare happens to be the largest provider of anti-DDoS and bot protection services, but if it wasn't them, it'd be someone else. I miss the open web, but I understand why site operators don't want to waste bandwidth and compute on high-volume bots that do not present a good value proposition to them.
Yes this does make it much harder for non-incumbents, and I don't know what to do about that.
https://www.robotstxt.org/faq/what.html
I wonder if cloudflare users explicitly have to allow google or if it's pre-allowed for them when setting up cloudflare.
Despite what Cloudflare wants us to think here, the web was always meant to be an open information network , and spam protection should not fundamentally change that characteristic.
They provide valeu for their companies. If you get some value from them it's just a side effect.
1) It takes your query, and given the complexity might expand it to several search queries using an LLM. ("rephrasing")
2) It runs queries against a web search index (I think it was using Bing or Brave at first, but they probably have their own by now), and uses an LLM to decide which are the best/most relevant documents. It starts writing a summary while it dives into sources (see next).
3) If necessary it will download full source documents that popped up in search to seed the context when generating a more in-depth summary/answer. They do this themselves because using OpenAI to do it is far more expensive.
#3 is the problem. Especially because SEO has really made it so the same sites pop up on top for certain classes of queries. (for example Reddit will be on top for product reviews alot). These sites operate on ad revenue so their incentive is to block. Perplexity does whatever they can in the game of sidestepping the sites' wishes. They are a bad actor.
EDIT: I should also add that Google, Bing, and others, always obey robots.txt and they are good netizens. They have enough scale and maturity to patiently crawl a site. I wholeheartedly agree that if an independent site is also a good netizen, they should not be blocked. If Perplexity is not obeying robots.txt and they are impatient, they should absolutely be blocked.
Why is it okay for me to ask my browser to do this but I can’t ask my LLM to do the same?
Am I misunderstanding something. I (the site owner) pay Cloudflare to do this. It is my fault this happens, not Cloudflare's.
Monopolistic is the wrong word, because you have the problem backwards. Cloudflare isnt helping Apple/Google... It's helping its paying consumers and those are the only services those consumers want to let through.
Do you know how I can predict that AI agents, the sort that end users use to accomplish real tasks, will never take off? Because the people your agent would interact with want your EYEBALLS for ads, build anti patterns on purpose, want to make it hard to unsubscribe, cancel, get a refund, do a return.
AI that is useful to people will fail. For the same reason that no one has great public API's any more. Because every public companies real customers are its stock holders, and the consumers are simply a source of revenue. One that is modeled, marked to, and manipulated all in the name of returns on investment.
I was recently working on a project where I needed to find out the published date for a lot of article links and this came helpful. Not sure if it's changed recently but asking ChatGPT, Gemini etc didn't work and it said that it doesn't have access to the current websites. However, asking perplexity, it fetched the website in real time and gave me the info I needed.
I do agree with the rest of your comment that this is not a random robot crawling. It was doing what a real user (me) asked it to fetch.
You say "shouldn't" here, but why?
There seems to be a fundamental conflict between two groups who each assert they have "rights":
* Content consumers claim the right to use whatever software they want to consume content.
* Content creators claim the right to control how their content is consumed (usually so that they can monetize it).
These two "rights" are in direct conflict.
The bias here on HN, at least in this thread, is clearly towards the first "right". And I tend to come down on this side myself, as a computer power user. I hate that I cannot, for example, customize the software I use to stream movies from popular streaming services.
But on the other hand, content costs money to make. Creators need to eat. If the content creators cannot monetize their content, then a lot of that content will stop being made. Then what? That doesn't seem good for anyone, right?
Whether or not you think they have the "right", Perplexity totally breaks web content monetization. What should we do about that?
(Disclosure: I work for Cloudflare but not on anything related to this. I am speaking for myself, not Cloudflare.)
Anyone circumventing bans is doing something shitty and ilegal, see the computer fraud and abuse act and craiglist v 3taps.
"And those LLMs didn't ask anyones permission to crawl the entire 'net."
False, openai respects robots.txt, doesnt mask ips, paid a bunch of $ to reddit.
You either side with the law or with criminals.
You can't even say the same thing about openAI because we don't know the corpus they train their models on.
Sure, the internet should be open and not trusted. But physical reality exists. Hosting and bandwidth cost money. I trust Google won't DDoS my site or cost my an arbitrary amount of money. I won't trust bots made by random people on the internet in the same way. The fact that Google respects robots.txt while Perplexity doesn't tells you why people trust Google more than random bots.
Google already has access to any webpage because its own search Crawlers are allowed by most websites, and google crawls recursively. Thus Gemini has an advantage of this synergy with google search. Perplexity does not crawl recursively (i presume -- therefore it does not need to consult robots.txt), and it doesn't have synergies with a major search engine.
So you just came here to bitch about Cloudflare? It's wild to even comment on this thread if this does not make sense to you.
They're building a search index. Every AI is going to struggle at being a tool to find websites & business listings without a search index.
Sorry CF, give up. the courts are on our sides here
The world is bigger than the USA.
Just because American tech giants have captured and corrupted legislators in the US doesn't mean the rest of the world will follow.
People like getting money for their work. You do too. Don't lose sight of that.
Sure, Google has to follow it because they’re a big company and need to respect certain laws or internal policies. But for everyone else, it’s basically just a “please don’t” sign, not a legal requirement or?
In summary: Officially, Perplexity claims its bots honor robots.txt. In practice, outside investigators and hosting providers document persistent circumvention of such directives by undeclared or disguised crawlers acting on Perplexity’s behalf, especially for real-time user queries
robots.txt is not a blocking mechanism; it's a hint to indicate which parts of a site might be of interest to indexing.
People started using robots.txt to lie and declare things like no part of their site is interesting, and so of course that gets ignored.
HN is not a platform for attacking people, even imagined ones.
Please don't fulminate.