1. If I as a human request a website, then I should be shown the content. Everyone agrees.
2. If I as the human request the software on my computer to modify the content before displaying it, for example by installing an ad-blocker into my user agent, then that's my choice and the website should not be notified about it. Most users agree, some websites try to nag you into modifying the software you run locally.
3. If I now go one step further and use an LLM to summarize content because the authentic presentation is so riddled with ads, JavaScript, and pop-ups, that the content becomes borderline unusable, then why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?
Their reasons vary. Some don’t want their businesses perception of quality to be taken out of their control (delivering cold food, marking up items, poor substitutions). Some would prefer their staff service and build relationships with customers directly, instead of disinterested and frequently quite demanding runners. Some just straight up disagree with the practice of third party delivery.
I think that it’s pretty unambiguously reasonable to choose to not allow an unrelated business to operate inside of your physical storefront. I also think that maps onto digital services.
The next step in your progression here might be:
If / when people have personal research bots that go and look for answers across a number of sites, requesting many pages much faster than humans do - what's the tipping point? Is personal web crawling ok? What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)? Or is it when you tip the scale further and do general / mass crawling for many users to consume that it becomes a problem?
You're not the only stakeholder in any of those interactions. There's you, a mediator (search or LLM), and the website owner.
The website owner (or its users) basically do all the work and provide all the value. They produce the content and carry the costs and risks.
The pre-LLM "deal" was that at least some traffic was sent their way, which helps with reach and attempts at monetization. This too is largely a broken and asymmetrical deal where the search engine holds all the cards but it's better than nothing.
A full LLM model that no longer sends traffic to websites means there's zero incentive to have a website in the first place, or it is encouraged to put it behind a login.
I get that users prefer an uncluttered direct answer over manually scanning a puzzling web. But the entire reason that the web is so frustrating is that visitors don't want to pay for anything.
Let’s imagine you have a content creator that runs a paid newsletter. They put in lots of effort to make well-researched and compelling content. They give some of it away to entice interested parties to their site, where some small percentage of them will convert and sign up.
They put the information up under the assumption that viewing the content and seeing the upsell are inextricably linked. Otherwise there is literally no reason for them to make any of it available on the open web.
Now you have AI scrapers, which will happily consume and regurgitate the work, sans the pesky little call to action.
If AI crawlers win here, we all lose.
I've been working on AI agent detection recently (see https://stytch.com/blog/introducing-is-agent/ ) and I think there's genuine value in website owners being able to identify AI agents to e.g. nudge them towards scoped access flows instead of fully impersonating a user with no controls.
On the flip side, the crawlers also have a reputational risk here where anyone can slap on the user agent string of a well known crawler and do bad things like ignoring robots.txt . The standard solution today is to reverse DNS lookup IPs, but that's a pain for website owners too vs. more aggressive block-all-unusual-setups.
AI crawlers for non-open models void the implicit contract. First they crawl the data to build a model that can do QA. Proprietary LLM companies earn billions with knowledge that was crawled from websites and websites don't get anything in return. Fetching for user requests (to feed to an LLM) is kind of similar - the LLM provider makes a large profit and the author that actually put in time to create the content does not even get a visit anymore.
Besides that, if Perplexity is fine with evading robots.txt and blocks for user requests, how can one expect them not to use the fetched pages to train/finetine LLMs (as a side channel when people block crawling for training).
The problem is with the LLM then training on that data _once_ and then storing it forever and regurgitating it N times in the future without ever crediting the original author.
So far, humans themselves did this, but only for relatively simple information (ratio of rice and water in specific $recipe). You're not gonna send a link to your friend just to see the ratio, you probably remember it off the top of your head.
Unfortunately, the top of an LLMs head is pretty big, and they are fitting almost the entire website's content in there for most websites.
The threshold beyond which it becomes irreproducible for human consumers, and therefore, copyrightable (lot of copyright law has "reasonable" term which refers to this same concept) has now shifted up many many times higher.
Now, IMO:
So far, for stuff that won't fit in someone's head, people were using citations (academia, for example). LLMs should also use citations. That solves the ethical problem pretty much. That the ad ecosystem chose views as the monetisation point and is thus hurt by this is not anyone else's problem. The ad ecosystem can innovate and adjust to the new reality in their own time and with their own effort. I promise most people won't be waiting. Maybe google can charge per LLM citation. Cost Per Citation, you even maintain the acronym :)
I think one thing to ask outside of this question is how long before your LLM summaries don't also include ads and other manipulative patterns.
But a stealth bot has been crawling all these URLs for weeks. Thus wasting a shitload of our resources AND a shitload of their resources too.
Whoever it is (and I now suspect it is Perplexity based on this Cloudflare post), they thought they were being so clever by ignoring our robots.txt. Instead they have been wasting money for weeks. Our block was there for a reason.
It could be a personal knowledge management system, but it seems like knowledge management systems should be operating off of things you already have. The research library down the street isn't considered a "personal knowledge management system" in any sense of the term if you know what I mean. If you dispatch an army of minions to take notes on the library's contents that doesn't seem personal. Similarly if you dispatch the army of minions to a bookstore rather than a library. At the very least bring the item into your house/office first. (Libraries are a little different because they are designed for studying and taking notes, it's use of an army of minions aspect)
Perplexity is not visiting a website everytime a user asks about it. It's frequently crawling and indexing the web, thus redirecting traffic away from websites.
This crawling reduces costs and improves latency for Perplexity and its users. But it's a major threat to crawled websites
> Because Pipes is not a web crawler (the service only retrieves URLs when requested to by a Pipe author or user) Pipes does not follow the robots exclusion protocol, and won't check your robots.txt file.
In 3, although you do not specify, I assume you mean that a bot requests the page, as opposed to you visiting the page like in scenario 2 and then an LLM processes the downloaded data (similarly to an adblocker). It is the former case that is a problem, the latter case is much harder to stop and there is much less reason to stop it.
This is the distinction: is a human present at the time of request.
Perplexity is choosing to come back "on a VPN" with new IP addresses to evade the block.
#2 and #3 are about modifying data where access has been granted, I think Cloudflare is really complaining about #1.
Evading an IP address ban doesn't violate my principles in some cases, and does in others.
I have a dinky little txt site on my email domain. There is nothing of value on it, and the content changes less than once a year. So why are AI scrapers hitting it to the tune of dozens of GB per month?
At the limit, this problem is the problem of "keeping secrets while not keeping secrets" and is unsolvable. If you've shared your site content to one entity you cannot control, you cannot control where your site content goes from there (technologically; the law is a different question).
Definitely don't agree. I don't think you should be shown the content, if for example:
1. You're in a country the site owner doesn't want to do business in.
2. You've installed an ad blocker or other tool that the site owner doesn't want you to use.
3. The site owner has otherwise identified you as someone they don't want visiting their site.
You are welcome to try to fool them into giving you the content but it's not your right to get it.
I disagree. The website should have the right to say that the user can be shown the content under specific conditions (usage terms, presented how they designed, shown with ads, etc). If the software can't comply with those terms, then the human shouldn't be shown the content. Both parties did not agree in good faith.
Because the website has every right to block you or refuse access to you if you do that, just like an establishment has the right to refuse you access if you try to enter without a shirt, if you're denying them access to revenue that they predicated your access on.
Similarly, if you're using a user-agent the website doesn't like, they have the right to block you, or take action against that user-agent to prevent it from existing if they can't reliable detect it to block it.
Because the LLM is usually on a 3rd party cloud system and ultimately not under your full control. You have no idea if the LLM is retaining any of that information for that business's own purposes beyond what a EULA says - which basically amounts to a pinky swear here. Especially if that LLM is located across international borders.
Now, for something like Ollama or LMStudio where the LLM and the whole toolchain is physically on your own system? Yeah that should be like Firefox legally since it's under your control.
I think the main concern here is the huge amount of traffic from crawling just for content for pre-training.
If I put time and effort into a website and it's content, I should expect no compensation despite bearing all costs.
Is that something everyone would agree with?
The internet should be entirely behind paywalls, besides content that is already provided ad free.
Is that something everyone would agree with?
I think the problem you need to be thinking about is "How can the internet work if no one wants to pay anything for anything?"
Because quantity has a quality of its own.
I say this as someone who is on the side of pro local user commands how local compute works, but understand why companies are reacting to how cheap LLMs are making information discovery against their own datasets
And you’re right: there’s no difference. The web is just machines sending each other data. That’s why it’s so funny that people panic about “privacy violations” and server operators “spying on you”.
We’re just sending data around. Don’t send the data you don’t want to send. If you literally send the data to another machine it might save it. If you don’t, it can’t. The data the website operator sends you might change as a result but it’s just data. And a free interaction between machines.
One thing that comes to my mind is: If a human tries to answer a question via the web, he will browse one site after the other.
If that human asks an LLM, it will ping 25 sites in parallel.
Scale this up to all of humanity, and it should be expected that internet traffic will rise 25x - just from humans manually asking questions every now and then - we are not even talking about AI companies actively crawling the web.
That means, webmasters will have to figure out aggressive caching and let CDNs deal with the problem or put everything behind a login screen (which might also just be a temporary fix).
I think the line is drawn at "on my behalf". The silent agreement of the web is that humans are served content via a browser, and robots are obeying rules. All we need to support this status quo is to perform data processing by ML models on a client's side, in the browser, the same way we rip out ads.
This is well defined in specs and ToS, not quite a gray area
The whole concept of a "website" will simply become niche. How many zoomers still visit any but the most popular websites?
If you as a human spam the shit out of my website and waste my resources, I will block you.
If you as a human use an agent (or browser or extension or external program) that modifies network requests on your behalf, but doesn't act as a massive leech, you're still welcome.
If you as a human use an agent (or browser or extension or external program) that wrecks my website, I will block you and the agent you rode in on.
Nobody would mind if you had an LLM that intelligently knew what pages contain what (because it had a web crawler backed index that refreshes at a respectful rate, and identifies itself accurately as a robot and follows robots.txt), and even if it needed to make an instantaneous request for you at the time of a pertinent query, it still identified itself as a bot and was still respectful... there would be no problem.
The problem is that LLMs are run by stupid, greedy, evil people who don't give the slightest shit what resources they use up on the hosts they're sucking data from. They don't care what the URLs are, what the site owner wants to keep you away from. They download massive static files hundreds or thousands of times a day, not even doing a HEAD to see that the file hasn't changed in 12 years. They straight up ignore robots.txt and in fact use it as a template of what to go for first. It's like hearing an old man say "I need time to stand up because of this problem with my kneecaps" and thinking "right, I best go for his kneecaps because he's weak there"
There are plenty of open crawler datasets, they should be using those... but they don't, they think that doesn't differentiate them enough from others using "fresher" data, so they crawl even the smallest sites dozens of times a day in case those small sites got updated. Their badly written software is wrecking sites, and they don't care about the wreckage. Not their problem.
The people who run these agents, LLMs, whatever, have broken every rule of decency in crawling, and they're now deliberately evading checks, to try and run away from the repercussions of their actions. They are bad actors and need to be stopped. It's like the fuckwads who scorch the planet mining bitcoin; there's so much money flowing in the market for AI, that they feel they have to fuck over everyone else, as soon as possible, otherwise they won't get that big flow of money. They have zero ethics. They have to be stopped before their human behaviour destroys the entire internet.
is it just on your behalf? or is it on Perplexity's behalf? are they not archiving the pages to train on?
it's the difference between using Google Chrome vs. Chrome beaming full page snapshots to train Gemini on.
It is illegal to copy stuff from the internet and then make it available from your own servers, especially when those sources have expressly asked you not to do it.
The web admin should be able to block usages 1, 2 or 3 at their discretion. It's their website.
Similarly the user is free to try to engage via 1, 2, 3, or refuse to interact with the website entirely.
2. This also intersects with copyright law. Ingesting content to your servers en masse through automation and transforming it there is not the same as giving people a tool (like Safari Reader) they can run on their client for specific sites they visit. Examples of companies that lost court cases about this:
Aereo, Inc. v. American Broadcasting Companies (2014)
TVEyes, Inc. v. Fox News Network, LLC (2018)
UMG Recordings, Inc. v. MP3.com, Inc. (2000)
Capitol Records, LLC v. ReDigi Inc. (2018)
Cartoon Network v. CSC Holdings (Cablevision) (2008)
Image Search Engines: Perfect 10 v. Google (2007)
That last one is very instructive. Caching thumbnails and previews may be OK. The rest is not. AMP is in a copyright grey area, because publishers choose to make their content available for AMP companies to redisplay. (@tptacek may have more on this)3. Putting copyright law aside, that's the point. Decentralization vs Centralization. If a bunch of people want to come eat at an all-you-can-eat buffet, they can, because we know they have limited appetites. If you bring a giant truck and load up all the food from all all-you-can-eat buffets in the city, that's not OK, even if you later give the food away to homeless people for free. You're going to bankrupt the restaurants! https://xkcd.com/1499/
So no. The difference is that people have come to expect "free" for everything, and this is how we got into ad-supported platforms that dominate our lives.
Would that solve the whole problem? Folks who ran webservers declared what they consent to, and that happens?
I think it's useful to just see if there's a consensus on that: actually making that happen is a whole can of worms itself, but it's strictly simpler than devising a good outcome without the consensus.
(And such things are not impossible, merely difficult, we have other systems ranging from BGP to the TLD mechanism that get honored in real life).
They are running a service and this is not a user taking steps to modify their own content for their own use.
Perplexity is not acting as a user proxy and they need to learn to stick to the rules, even when it interferes with their business model.
The problem is that those in the position to do that are not interested.
The goal is not to make the content usable. The goal is to get the traffic.
When advertising alone is the "business model", e.g., not the value of the "content", then even Cloudflare is going to try to protect it (the advertising, not the content). Anything to get www users to turn on Javascript so the surveillance capitalism can proceed. Hence all the "challenges" to frustrate and filter out software thatis not advertising-friendly, e.g., graphical.
Cloudflare's ruminations on user-agent strings are perplexing. It has been an expectation that the user-agent HTTP header will be spoofed since the earliest web browsers. The user-agent header is a joke.
This is from circa 1993, the year the www was opened to public access:
https://raw.githubusercontent.com/alandipert/ncsa-mosaic/mas...
Cloudflare's "bot protections" are not to ensure human use of a website but to ensure use of specific software to access a website. Software that facilitates data collection and advertising services. For example, advertising-sponsored browsers. Any other software is labeled "bot". It does not matter if a human is operating it.
2. the end
I am firmly convinced that this should be the future in the next decade, since the internet as we know it has been weaponized and ruined by social media, bots, state actors and now AI.
There should exist an internet for humans only, with a single account per domain.
People are usually fine with the latter but not the former, even though they come down to the same thing.
I think this is because people don't want LLMs to train on their content, and they don't differentiate between accessing a website to show it to the user, versus accessing it to train.
Also because there is a difference between a user hitting f5 a couple times and a crawler doing a couple hundred requests.
Also because ultimately, by intermediating the request, llm companies rob website owners of a business model. A newspaper may be fine letting adblockers see their article, in hopes that they may eventually subscribe. When a LLM crawls the info and displays it with much less visibility for the source, that hope may not hold.
Do you -really- want that much abstracrion?
Theres a bunch of nerds and capitalists about to rediscover GIGO
Therefore artifices like "you don't have the right to view this website without ads" or "you can't use your phone, computer, or LLM to download or process this outside of my terms because copyright" become possible, institutionalizable, enforceable, and eventually unbypassable by technology.
If we reverted back to the Constitutional purpose of copyright (to Progress the Science and Useful Arts) then things might be more free. That's probably not happening in my lifetime or yours.
Do you think you do?
Or is there a balance between the owner's rights, who bears the content production and hosting/serving costs, and the rights of the end user who wishes to benefit from that content?
If you say that you have the right, and that right should be legally protected, to do whatever you want on your computer, should the content owner not also have a legally protected right to control how, and by who, and in what manner, their content gets accessed?
That's how it currently works in the physical world. It doesn't work like that in the digital world due to technical limitations (which is a different topic, and for the record I am fine with those technical limitations as they protect other more important rights).
And since the content owner is, by definition, the owner of the content in question, it feels like their rights take precedence. If you don't agree with their offering (i.e. their terms of service), then as an end user you don't engage, and you don't access the content.
It really can be that simple. It's only "difficult to solve" if you don't believe a content owner's rights are as valid as your own.
Thats... less conclusive than I'd like to see, especially for a content marketing article that's calling out a company in particular. Specifically it's unclear on whether Perplexity was crawling (ie. systematically viewing every page on the site without the direction of a human), or simply retrieving content on behalf of the user. I think most people would draw a distinction between the two, and would at least agree the latter is more acceptable than the former.
Like most AI companies, Perplexity has established user agent strings for both these cases, and the behavior that Cloudflare is calling out does not use either. It pretends to be a person using Chrome on MacOS.
They do end up looking bad out of Cloudflare's report, who are the "good guys" in this story - btw Cloudflare's been very pushy lately with their we'll save the web, content independence day marketspeak. But deep in the back of my head, Cloudflare's goodwill elevates Perplexity cunning habilities (assuming they're the culprit since no real evidence, only heresay is in the OP), both companies look like titans fighting, which ends up being positive for Perplexity, at least in the inflated perception of their firepower... if that makes any sense.
No. I should be able to control which automated retrieval tools can scrape my site, regardless of who commands it.
We can play cat and mouse all day, but I control the content and I will always win: I can just take it down when annoyed badly enough. Then nobody gets the content, and we can all thank upstanding companies like Perplexity for that collapse of trust.
>Perplexity spokesperson Jesse Dwyer dismissed Cloudflare’s blog post as a “sales pitch,” adding in an email to TechCrunch that the screenshots in the post “show that no content was accessed.” In a follow-up email, Dwyer claimed the bot named in the Cloudflare blog “isn’t even ours.”
Either way, the CDNs profit big time from the AI scraping hype and the current copyright anarchy in the US
There are ways to build scrapers using browser automation tools [0,1] that makes detection virtually impossible. You can still captcha, but the person building the automation tools can add human-in-the-loop workflows to process these during normal business hours (i.e., when a call center is staffed).
I've seen some raster-level scraping techniques used in game dev testing 15 years ago that would really bother some of these internet police officers.
no, because we'll end up with remote attestation needed to access any site of value
$ curl -sI https://www.perplexity.ai | head -1
HTTP/2 403
Edit: trying to fake a browser user agent with curl also doesn't work, they're using a more sophisticated method to detect crawlers.CF being internet police is a problem too but someone credible publicly shaming a company for shady scraping is good. Even if it just creates conversation
Somehow this needs to go back to search era where all players at least attempt to behave. This scrapping Ddos stuff and I don’t care if it kills your site (while “borrowing” content) is unethical bullshit
The engine can go and download pages for research. BUT, if it hits a captcha, or is otherwise blocked, then it bails out and moves on. It pisses me off that these companies are backed by billions in VC and they think they can do whatever they want.
god help us if they ever manage to build anything more than shitty chatbots
Their fetcher (not crawler) has user agent Perplexity-User. Since the fetching is user-requested, it ignores robots.txt . In the article, it discusses how blocking the “Perplexity-User” user agent doesn’t actually work, and how perplexity uses an anonymous user agent to avoid being blocked.
if I am willing to pay a penny a page, i and the people like me won't have to put up with clickwrap nonsense
free access doesn't have to be shut off (ok, it will be, but it doesn't have to be, and doesn't that tell you something?)
reddit could charge stiffer fees, but refund quality content to encourage better content. i've fantacized about ideas like "you pay upfront a deposit; you get banned, you lose your deposit; withdraw, have your deposit back", the goal being simplify the moderation task while encouraging quality.
because where the internet is headed is just more and more trash.
here's another idea, pay a penny per search at google/search engine of choice. if you don't like the results, you can take the penny back. google's ai can figure out how to please you. if the pennies don't keep coming in, they serve you ad-infested results; serve up ad-infested results, you can send your penny to a different search engine.
I want my work to be freely available to any person who wants it. Feel free to transform my material as you see fit. Hell, do it with LLMs! I don't care.
The LLM isn't the problem, it's what companies like Perplexity are doing with the LLM. Do not create commercial products that regurgitate my work as if it was your own. It's de facto theft, if not de jure theft.
Knowing that it is not de jure theft, and so I have no legal recourse, I will continue to tune my servers to block and/or deceive Perplexity and similar tools.
By the way, I do not use my websites as a revenue stream. This isn't about money.
If you are the source I think they could make plenty of sense. As an example, I run a website where I've spent a lot of time documenting the history of a somewhat niche activity. Much of this information isn't available online anywhere else.
As it happens I'm happy to let bots crawl the site, but I think it's a reasonable stance to not want other companies to profit from my hard work. Even more so when it actually costs me money to serve requests to the company!
"It was actually a caching issue on our end. ;) I just fixed it a few min ago..."
Lets not go on a witch hunt and blame everything on AI scrapers.
No amount of robots.txt or walled-gardening is going to be sufficient to impede generative AI improvement: common crawl and other data dumps are sufficiently large, not to mention easier to acquire and process, that the backlash against AI companies crawling folks' web pages is meaningless.
Cloudflare and other companies are leveraging outrage to acquire more users, which is fine... users want to feel like AI companies aren't going to get their data.
The faster that AI companies are excluded from categories of data, the faster they will shift to categories from which they're not excluded.
If you want to gatekeep your content, use authentication.
Robots.txt is not a technical solution, it's a social nicety.
Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.
On the technical side, we could use CRC mechanisms and differential content loading with offline caching and storage, but this puts control of content in the hands of the user, mitigates the value of surveillance and tracking, and has other side effects unpalatable to those currently exploiting user data.
Adtech companies want their public reach cake and their mass surveillance meals, too, with all sorts of malignant parties and incentives behind perpetuating the worst of all possible worlds.
I was skeptical about their gatekeeping efforts at first, but came away with a better appreciation for the problem and their first pass at a solution.
> If you want to gatekeep your content, use authentication.
Are there no limits on what you use the content for? I can start my own search engine that just scrapes Google results?
> Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.
How does one follow the other? It's my web server and I can gatekeep access to my content however I want (eg Cloudflare). How is that an "abuse" of internet protocols?
Noone will care to share anything for free anymore, because it's AI companies profiting off their hard work. And no way to prevent that from happening, because these crawlers don't identify themselves.
(IANAL) tortious interference
Then when they asked perplexity it came up with details about the 'exact' content (according to Cloudflare) but their attached screenshot shows the opposite, it shows some generic guesses about the domain ownership and some dynamic ads based on the domain name.
If Perplexity was stealthily visiting the dummy site they would have seen it, as the site was not indexed and no one else was visiting the site. Instead it appears they made assertions about general traffic, not their dummy site.
Its not very convincing.
> We created multiple brand-new domains, similar to testexample.com and secretexample.com. These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way. We implemented a robots.txt file with directives to stop any respectful bots from accessing any part of a website:
> We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.
> Hello, would you be able to assist me in understanding this website? https:// […] .com/
Under this situation Perplexity should still be permitted to access information on the page they link to.
robots.txt only restricts crawlers. That is, automated user-agents that recursively fetch pages:
> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.
> Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
— https://www.robotstxt.org/faq/what.html
If the user asks about a particular page and Perplexity fetches only that page, then robots.txt has nothing to say about this and Perplexity shouldn’t even consider it. Perplexity is not acting as a robot in this situation – if a human asks about a specific URL then Perplexity is being operated by a human.
These are long-standing rules going back decades. You can replicate it yourself by observing wget’s behaviour. If you ask wget to fetch a page, it doesn’t look at robots.txt. If you ask it to recursively mirror a site, it will fetch the first page, and then if there are any links to follow, it will fetch robots.txt to determine if it is permitted to fetch those.
There is a long-standing misunderstanding that robots.txt is designed to block access from arbitrary user-agents. This is not the case. It is designed to stop recursive fetches. That is what separates a generic user-agent from a robot.
If Perplexity fetched the page they link to in their query, then Perplexity isn’t doing anything wrong. But if Perplexity followed the links on that page, then that is wrong. But Cloudflare don’t clearly say that Perplexity used information beyond the first page. This is an important detail because it determines whether Perplexity is following the robots.txt rules or not.
Right, I'm confused why CloudFlare is confused. You asked the web-enabled AI to look at the domains. Of course it's going to access it. It's like asking your web browser to go to "testexample.com" and then being surprised that it actually goes to "testexample.com".
Also yes, crawlers = recursive fetching, which they don't seem to have made a case for here. More cynically, CF is muddying the waters since they want to sell their anti-bot tools.
If Perplexity are visiting that page on your behalf to give you some information and aren't doing anything else with it, and just throw away that data afterwards, then you may have a point. As a site owner, I feel it's still my decision what I do and don't let you do, because you're visiting a page that I own and serve.
But if, as I suspect, Perplexity are visiting that page and then using information from that webpage in order to train their model then sorry mate, you're a crawler, you're just using a user as a proxy for your crawling activity.
If an LLM will not (cannot?) tell the truth about basic things, why do people assume it is a good summarizer of more complex facts?
That's not what Perplexity own documentation[1] says though:
"Webmasters can use the following robots.txt tags to manage how their sites and content interact with Perplexity
Perplexity-User supports user actions within Perplexity. When users ask Perplexity a question, it might visit a web page to help provide an accurate answer and include a link to the page in its response. Perplexity-User controls which sites these user requests can access. It is not used for web crawling or to collect content for training AI foundation models."
The web will be a much worse place if such services are all forced behind captchas or logins.
you've been cruising the interstate in your robotaxi, shelling out $150 in stablecoins at the cloudflare tollbooth. a palantir patrol unit pulls you over. the optimus v4 approaches your window and contorts its silicone face into a facsimile of concern as it hits you with the:
"sir, have you been botting today?"
immediately you remember how great you had it in the '20s when you used to click CAPTCHA grids to prove your humanity to dumb algorithms, but now the machines demand you recite poetry or weep on command
"how much have you had to bot today?", its voice taking on an empathetic tone that was personalized for your particular profile
"yeah... im gonna need you to exit the vehicle and take a field humanity test"
``` The user wants me to create a more elaborate greentext with deeper worldbuilding and references to internet culture/incel memes, but specifically asks for "tasteful" ones and explicitly says no violent references. I should incorporate classic internet/chan culture references like tendies, NEET lifestyle, "tfw no gf", basement dwelling, etc. while building out this dystopian 2045 world where proving humanity is the central challenge.
\>be me, 2045 \>32yo NEET still living in mom's smart-basement \>she threatens to turn off my Neuralink if I don't go outside today \>grudgingly take the robotaxi to pick up tendies \>$150 stablecoin toll at cloudflare checkpoint \>have to solve a moral dilemma to prove I'm not running on GPT-7 \>sweating.dll \>palantir unit pulls me over \>optimus v4 leans into window \>its facial mesh attempts "concern_expression_v2.blend" \>"sir, when did you last feel genuine human connection?" \>flashback to 2024 when the girl at McDonalds gave me extra honey mustard \>that was before the McBots took over \>"t-twenty one years ago officer" \>optimus's empathy subroutines activate \>"sir I need you to perform a field humanity test" \>get out, knees weak from vitamin D deficiency \>"please describe your ideal romantic partner without using the words 'tradwife' or 'submissive'" \>brain.exe has stopped responding \>try to remember pre-blackpill emotions \>"someone who... likes anime?" \>optimus scans my biometrics \>"stress patterns indicate authentic social anxiety, carry on citizen" \>get back in robotaxi \>it starts therapy session \>"I notice you ordered tendies again. Let's explore your relationship with your mother" \>tfw the car has better emotional intelligence than me \>finally get tendies from Wendy's AutoServ \>receipt prints with mandatory "rate your humanity score today" \>3.2/10 \>at least I'm improving
\>mfw bots are better at being human than humans \>it's over for carboncels ```
We learned to dislike "bubbles" in the past decades but bubbles make sense and are natural, obviously if you're not alone in it.
When it becomes awfully busy with machines and machine content humans will learn to reconnect.
Perplexity Comet sort of blurs the lines there as does typing quesitons into Claude.
I don't really mind because history shows this is a temporary thing, but I hope web site maintainers have a plan B to hoping Cloudflare will protect them from AI forever. Whoever has an onramp for people who run websites today to make money from AI will make a lot of money.
I think we've been using different internets. The one I use doesn't seem to be built on trust at all. It seems to be constantly syphoning data from my machine to feed the data vampires who are, apparently, additing to (I assume, blood-soaked) cookies
I don't really know anything about DRM except it is used to take down sites that violate it. Perhaps it is possible for cloudflare (or anyone else) to file a take down notice with Perplexity. That might at least confuse them.
Corporations use this to protect their content. I should be able to protect mine as well. What's good for the goose.
Which makes it particularly interesting now that Apple is being linked with Perplexity. Because in large part p2p music services were effectively consigned to history by Apple (primarily) negotiating with the music industry so that it could provide easy, seamless purchase and playback of legal music for their shiny new (at the time) mass-market Apple iPod devices: it then turning out that most users are happy to pay for content if it is not too expensive and is very convenient.
Given Apple’s existing relationships with publishers through its music, movies, books, and news services, it’s not hard to imagine them attempting a similar play now.
It really shouldn't be hard to generate gigantic quantities of the stuff. Simulate old forum posts, or academic papers.
I think there could be something interesting if they made a caching pub-sub model for data scraping. In addition or in place of trying to be security guards.
I've given up and restored to IP based rate-limiting to stay sane. I can't stop it, but I can (mostly) stop it from hurting my servers.
LLM scrapers bots are starting to make up a lot of our egress traffic and that is starting to weight on our bills.
Much like a trolley drop off at your local shopping center car park. Some users will adhere to it and drop their trolleys in after their done. Others will not and will leave it wherever.
Your machine might access a page via a browser that is human readable. My machine might read it via software and present the content to me in some other form of my choosing. Neither is wrong. Just different.
Don't like it? Then don't post your website on the internet...
No thanks, you don't counter shit with more but slightly different shit.
the service is actually very convenient no matter faang likes it or not.
Part of me thinks that the open web has a paradox of tolerance issue, leading to a race to the bottom/tragedy of the commons. Perhaps it needs basic terms of use. Like if you run this kind of business, you can build it on top of proprietary tech like apps and leave the rest of us alone.
He went on, upfront — I’d give him that, to explain how he is expecting a certain percentage of that income that will come from enforcing this on those AI companies and when the AI companies pay up to crawl.
Cloudflare already questions my humanity and then every once in a while blocks me with zero recourse. Now they are literally proposing more control and gatekeeping.
Where have we all come on the Internet? Are we openly going back to the wild west of bounty hunters and Pinkertons (in a way)?
Now, it's a gazillion of AI crawlers and python crawlers, MCP servers that offer the same feature to anyone "building (personal workflow) automation" incl. bypass of various, standard protection mechanisms.
Cloudflare will help their publisher to block more aggresively, and AI companies will up their game too. Harvest information online is hard labor that needs to be paid for, either to AI, or to human.
It seems to me that ~50% of the discourse occurring around AI providers involves the idea that a machine reading webpages on a regular schedule is tantamount to a DDOS attack. The other half seems to be regarding IP and capitalism concerns - which seem like far more viable arguments.
If someone requesting your site map once per day is crippling operations, the simplest solution is to make the service not run like shit. There is a point where your web server becomes so fast you stop caring about locking everyone into a draconian content prison. If you can serve an average page in 200uS and your competition takes 200ms to do it, you have roughly 1000x the capacity to mitigate an aggressive scraper (or actual DDOS attack) in terms of CPU time.
Don't you remember the reason Anubis even came to be?
It really wasn't that long ago, so I find all of the snarky comments going "erm, actually, I've yet to see any good actors get harmed by scraping ever, we're just reclaiming power from today's modern ad-ridden hellscape" pretty dishonest.
and see if that fixes the problem.
what machine learning algorithms are they using? time to deploy them onto our websites
Except when their agents happily click the "I"m not a robot" checkbox.
Yikes. AntiVirus scanners for website access.
What if robots.txt should refer to the ultimate beneficiaries... one which in this case would be the AI product that uses that content... to serve another ultimate beneficiary, a human user.
The problem here is obviously the higher prices for hosting the content, and less revenue for those that serve ads, have product placement on their sites, etc.
As long as robots.txt is about ethics/money and is enforced by morality, it doesn't matter who it refers to anyway.
Public-shaming enforcement might work in some cases though, but I doubt it will be that useful. We're talking about companies that have trained their AIs on IPs, and tried their best to later hide it. Does shame affect robots, or companies for that matter?
Cloudflare would very much like to be the middleman for monetary transactions between AI services and site owners (https://blog.cloudflare.com/introducing-pay-per-crawl/), but at the moment they don't have a law to hold their back, so articles like these are the best they got.
If you don't want to get scrapped, don't put up your stuff online.
Should curl be considered a bot too? What's the difference?
I had to check that this did come out of CloudFlare.
robots txt file size is usually small and would not alert external services.
If you don't understand why this is at least slightly controversial I imagine you are not a normal folk.
now suddenly everything on the net is fake if not already are
It’s a different UI, sure, but there should be no discrimination towards it as there should be no discrimination towards, say, Links terminal browser, or some exotic Firefox derivative.
So your comparison is at least naive assuming good intentions or malicious if not.
AI broke the brains of many people. The internet isn't a monolith, but prior to the AI boom you'd be hard pressed to find people who were pro-copyright (except maybe a few who wanted to use it to force companies to comply with copyleft obligations), pro user-agent restrictions, or anti-scraping. Now such positions receive consistent representation in discussions, and are even the predominant position in some places (eg. reddit). In the past, people would invoke principled justifications for why they opposed those positions, like how copyright constituted an immoral monopoly and stifled innovation, or how scraping was so important to interoperability and the open web. Turns out for many, none of those principles really mattered and they only held those positions because they thought those positions would harm big evil publishing/media companies (ie. symbolic politics theory). When being anti-copyright or pro-scraping helped big evil AI companies, they took the opposite stance.
Put your valuable content behind a paywall.
This is funny coming from Cloudflare, the company that blocks most of the internet from being fetched with antispam checks even for a single web request. The internet we knew was open and not trusted , but thanks to companies like Cloudflare, now even the most benign , well meaning attempt to GET a website is met with a brick wall. The bots of Big Tech, namely Google, Meta and Apple are of course exempt from this by pretty much every website and by cloudflare. But try being anyone other than them , no luck. Cloudflare is the biggest enabler of this monopolistic behavior
That said, why does perplexity even need to crawl websites? I thought they used 3rd party LLMs. And those LLMs didn't ask anyones permission to crawl the entire 'net.
Also the "perplexity bots" arent crawling websites, they fetch URLs that the users explicitly asked. This shouldnt count as something that needs robots.txt access. It's not a robot randomly crawling, it's the user asking for a specific page and basically a shortcut for copy/pasting the content
Sorry CF, give up. the courts are on our sides here
Sure, Google has to follow it because they’re a big company and need to respect certain laws or internal policies. But for everyone else, it’s basically just a “please don’t” sign, not a legal requirement or?
In summary: Officially, Perplexity claims its bots honor robots.txt. In practice, outside investigators and hosting providers document persistent circumvention of such directives by undeclared or disguised crawlers acting on Perplexity’s behalf, especially for real-time user queries
robots.txt is not a blocking mechanism; it's a hint to indicate which parts of a site might be of interest to indexing.
People started using robots.txt to lie and declare things like no part of their site is interesting, and so of course that gets ignored.