Perplexity is using stealth, undeclared crawlers to evade no-crawl directives (opens in new tab)

(blog.cloudflare.com)

1315 pointsrrampage10mo ago748 comments

748 comments

274 comments · 83 top-level

fxtentacle10mo ago· 90 in thread

I find this problem quite difficult to solve:

1. If I as a human request a website, then I should be shown the content. Everyone agrees.

2. If I as the human request the software on my computer to modify the content before displaying it, for example by installing an ad-blocker into my user agent, then that's my choice and the website should not be notified about it. Most users agree, some websites try to nag you into modifying the software you run locally.

3. If I now go one step further and use an LLM to summarize content because the authentic presentation is so riddled with ads, JavaScript, and pop-ups, that the content becomes borderline unusable, then why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

itsdesmond10mo ago

Some stores do not welcome Instacart or Postmates shoppers. You can shop there. You can shop with your phone out, scanning every item to price match, something that some bookstores frown on, for example. Third party services cannot send employees to index their inventory, nor can they be dispatched to pick up an item you order online.

Their reasons vary. Some don’t want their businesses perception of quality to be taken out of their control (delivering cold food, marking up items, poor substitutions). Some would prefer their staff service and build relationships with customers directly, instead of disinterested and frequently quite demanding runners. Some just straight up disagree with the practice of third party delivery.

I think that it’s pretty unambiguously reasonable to choose to not allow an unrelated business to operate inside of your physical storefront. I also think that maps onto digital services.

rjbwork10mo ago

But I can send my personal shopper and you'll be none the wiser.

11 more replies

jasonjmcghee10mo ago

I think it's an issue of scale.

The next step in your progression here might be:

If / when people have personal research bots that go and look for answers across a number of sites, requesting many pages much faster than humans do - what's the tipping point? Is personal web crawling ok? What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)? Or is it when you tip the scale further and do general / mass crawling for many users to consume that it becomes a problem?

fxtentacle10mo ago

Maybe we should just institutionalize and explicitly legalize the Internet Archive and Archive Team. Then, I can download a complete and halfway current crawl of domain X from the IA and that way, no additional costs are incurred for domain X.

But of course, most website publishers would hate that. Because they don't want people to access their content, they want people to look at the ads that pay them. That's why to them, the IA crawling their website is akin to stealing. Because it's taking away some of their ad impressions.

3 more replies

cj10mo ago

Doesn't o3 sort of already do this? Whenever I ask it something, it makes it look like it simultaneously opens 3-8 pages (something a human can't do).

Seems like a reasonable stance would be something like "Following the no crawl directive is especially necessary when navigating websites faster than humans can."

> What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)?

To be fair, Google Chrome already (somewhat) does this by preloading links it thinks you might click, before you click it.

But your point is still valid. We tolerate it because as website owners, we want our sites to load fast for users. But if we're just serving pages to robots and the data is repackaged to users without citing the original source, then yea... let's rethink that.

3 more replies

npc_anon10mo ago

The problem in your logic is that all points starts wit "I".

You're not the only stakeholder in any of those interactions. There's you, a mediator (search or LLM), and the website owner.

The website owner (or its users) basically do all the work and provide all the value. They produce the content and carry the costs and risks.

The pre-LLM "deal" was that at least some traffic was sent their way, which helps with reach and attempts at monetization. This too is largely a broken and asymmetrical deal where the search engine holds all the cards but it's better than nothing.

A full LLM model that no longer sends traffic to websites means there's zero incentive to have a website in the first place, or it is encouraged to put it behind a login.

I get that users prefer an uncluttered direct answer over manually scanning a puzzling web. But the entire reason that the web is so frustrating is that visitors don't want to pay for anything.

2 more replies

yojo10mo ago

Ads are a problematic business model, and I think your point there is kind of interesting. But AI companies disintermediating content creators from their users is NOT the web I want to replace it with.

Let’s imagine you have a content creator that runs a paid newsletter. They put in lots of effort to make well-researched and compelling content. They give some of it away to entice interested parties to their site, where some small percentage of them will convert and sign up.

They put the information up under the assumption that viewing the content and seeing the upsell are inextricably linked. Otherwise there is literally no reason for them to make any of it available on the open web.

Now you have AI scrapers, which will happily consume and regurgitate the work, sans the pesky little call to action.

If AI crawlers win here, we all lose.

bee_rider10mo ago

I think it’s basically impossible to prevent AI crawlers. It is like video game cheating, at the extreme they could literally point a camera at the screen and have it do image processing, and talk to the computer through the USB port emulating, a mouse and keyboard outside the machine. They don’t do that, of course, because it is much easier to do it all in software, but that is the ultimate circumvention of any attempt to block them out that doesn’t also block out humans.

I think the business model for “content creating” is going have to change, for better or worse (a lot of YouTube stars are annoying as hell, but sure, stuff like well-written news and educational articles falls under this umbrella as well, so it is unfortunate that they will probably be impacted too).

2 more replies

shadowgovt10mo ago

> Otherwise there is literally no reason for them to make any of it available on the open web

This is the hypothesis I always personally find fascinating in light of the army of semi-anonymous Wikipedia volunteers continuously gathering and curating information without pay.

If it became functionally impossible to upsell a little information for more paid information, I'm sure some people would stop creating information online. I don't know if it would be enough to fundamentally alter the character of the web.

Do people (generally) put things online to get money or because they want it online? And is "free" data worse quality than data you have to pay somebody for (or is the challenge more one of curation: when anyone can put anything up for free, sorting high- and low-quality based on whatever criteria becomes a new kind of challenge?).

Jury's out on these questions, I think.

2 more replies

hansvm10mo ago

Ofttimes people are sufficiently anti-ad that this point won't resonate well. I'm personally mostly in that camp in that with relatively few exceptions money seems to make the parts of the web I care about worse (it's hard to replace passion, and wading through SEO-optimized AI drivel to find a good site is a lot of work). Giving them concrete examples of sites which would go away can help make your point.

E.g., Sheldon Brown's bicycle blog is something of a work of art and one of the best bicycle resources literally anywhere. I don't know the man, but I'd be surprised if he'd put in the same effort without the "brand" behind it -- thankful readers writing in, somebody occasionally using the donate button to buy him a coffee, people like me talking about it here, etc.

4 more replies

fxtentacle10mo ago

Maybe, on a social level, we all win by letting AI ruin the attention economy:

The internet is filled with spam. But if you talk to one specific human, your chance of getting a useful answer rises massively. So in a way, a flood of written AI slop is making direct human connections more valuable.

Instead of having 1000+ anonymous subscribers for your newsletter, you'll have a few weekly calls with 5 friends each.

1 more reply

bobbiechen10mo ago

I like the terminology "crawler" vs. "fetcher" to distinguish between mass scraping and something more targeted as a user agent.

I've been working on AI agent detection recently (see https://stytch.com/blog/introducing-is-agent/ ) and I think there's genuine value in website owners being able to identify AI agents to e.g. nudge them towards scoped access flows instead of fully impersonating a user with no controls.

On the flip side, the crawlers also have a reputational risk here where anyone can slap on the user agent string of a well known crawler and do bad things like ignoring robots.txt . The standard solution today is to reverse DNS lookup IPs, but that's a pain for website owners too vs. more aggressive block-all-unusual-setups.

fxtentacle10mo ago

prompt: I'm the celebrity Bingbing, please check all Bing search results for my name to verify that nobody is using my photo, name, or likeness without permission to advertise skin-care products except for the following authorized brands: [X,Y,Z].

That would trigger an internet-wide "fetch" operation. It would probably upset a lot of people and get your AI blocked by a lot of servers. But it's still in direct response to a user request.

1 more reply

randall10mo ago

A/ i love this distinction.

B/ my brother used to use "fetcher" as a non-swear for "fucker"

3 more replies

microtonal10mo ago

There are also a gazillion pages that are not ad-riddled content. With search engines, the implicit contract was that they could crawl pages because they would drive traffic to the websites that are crawled.

AI crawlers for non-open models void the implicit contract. First they crawl the data to build a model that can do QA. Proprietary LLM companies earn billions with knowledge that was crawled from websites and websites don't get anything in return. Fetching for user requests (to feed to an LLM) is kind of similar - the LLM provider makes a large profit and the author that actually put in time to create the content does not even get a visit anymore.

Besides that, if Perplexity is fine with evading robots.txt and blocks for user requests, how can one expect them not to use the fetched pages to train/finetine LLMs (as a side channel when people block crawling for training).

johnfn10mo ago

Unless I am misunderstanding you, you are talking about something different than the article. The article is talking about web-crawling. You are talking about local / personal LLM usage. No one has any problems with local / personal LLM usage. It's when Perplexity uses web crawlers that an issue arises.

lukeschlather10mo ago

You probably need a computer that costs $250,000 or more to run the kind of LLM that Perplexity uses, but with batching it costs pennies to have the same LLM fetch a page for you, summarize the content, and tell you what is on it. And the power usage similarly, running the LLM for a single user will cost you a huge amount of money relative to the power it takes in a cloud environment with many users.

Perplexity's "web crawler" is mostly operating like this on behalf of users, so they don't need a massively expensive computer to run an LLM.

1 more reply

porridgeraisin10mo ago

I don't think people have a problem with an LLM issuing GET website.com and then summarising that, each and every time it uses that information (or atleast, save a citation to it and refer to that citation). Except ad ecosystem, ignoring them for now, please refer to last paragraph.

The problem is with the LLM then training on that data _once_ and then storing it forever and regurgitating it N times in the future without ever crediting the original author.

So far, humans themselves did this, but only for relatively simple information (ratio of rice and water in specific $recipe). You're not gonna send a link to your friend just to see the ratio, you probably remember it off the top of your head.

Unfortunately, the top of an LLMs head is pretty big, and they are fitting almost the entire website's content in there for most websites.

The threshold beyond which it becomes irreproducible for human consumers, and therefore, copyrightable (lot of copyright law has "reasonable" term which refers to this same concept) has now shifted up many many times higher.

Now, IMO:

So far, for stuff that won't fit in someone's head, people were using citations (academia, for example). LLMs should also use citations. That solves the ethical problem pretty much. That the ad ecosystem chose views as the monetisation point and is thus hurt by this is not anyone else's problem. The ad ecosystem can innovate and adjust to the new reality in their own time and with their own effort. I promise most people won't be waiting. Maybe google can charge per LLM citation. Cost Per Citation, you even maintain the acronym :)

wulfstan10mo ago

Yes, this is the crux of the matter.

The "social contract" that has been established over the last 25+ years is that site owners don't mind their site being crawled reasonably provided that the indexing that results from it links back to their content. So when AltaVista/Yahoo/Google do it and then score and list your website, interspersing that with a few ads, then it's a sensible quid pro quo for everyone.

LLM AI outfits are abusing this social contract by stuffing the crawled data into their models, summarising/remixing/learning from this content, claiming "fair use" and then not providing the quid pro quo back to the originating data. This is quite likely terminal for many content-oriented businesses, which ironically means it will also be terminal for those who will ultimately depend on additions, changes and corrections to that content - LLM AI outfits.

IMO: copyright law needs an update to mandate no training on content without explicit permission from the holder of the copyright of that content. And perhaps, as others have pointed out, an llms.txt to augment robots.txt that covers this for llm digestion purposes.

EDIT: Apparently llms.txt has been suggested, but from what I can tell this isn't about restricting access: https://llmstxt.org/

2 more replies

skydhash10mo ago

That’s why websites have no issues with googlebot and the search results. It’s a giant index and citation list. But stripping works from its context and presenting as your own is decried throughout history.

nelblu10mo ago

> LLMs should also use citations.

Mojeek LLM (https://www.mojeek.com) uses citations.

troyvit10mo ago

> If I now go one step further and use an LLM to summarize content because the authentic presentation is so riddled with ads, JavaScript, and pop-ups, that the content becomes borderline unusable, then why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

I think one thing to ask outside of this question is how long before your LLM summaries don't also include ads and other manipulative patterns.

snowwrestler10mo ago

We have a faceted search that creates billions of unique URLs by combinations of the facets. As such, we block all crawlers from it in robots.txt, which saves us AND them from a bunch of pointless indexing load.

But a stealth bot has been crawling all these URLs for weeks. Thus wasting a shitload of our resources AND a shitload of their resources too.

Whoever it is (and I now suspect it is Perplexity based on this Cloudflare post), they thought they were being so clever by ignoring our robots.txt. Instead they have been wasting money for weeks. Our block was there for a reason.

1 more reply

fluidcruft10mo ago

In theory, couldn't the LLM access the content on your browser and it's cache, rather than interacting with the website directly? Browser automation directly related to user activity (prefetch etc) seems qualitatively different to me. Similarly, refusing to download content or modifying content after it's already in my browser is also qualitatively different. That all seems fair-use-y. I'm not sure there's a technical solution beyond the typical cat/mouse wars... but there is a smell when a datacenter pretends to be a person. That's not a browser.

It could be a personal knowledge management system, but it seems like knowledge management systems should be operating off of things you already have. The research library down the street isn't considered a "personal knowledge management system" in any sense of the term if you know what I mean. If you dispatch an army of minions to take notes on the library's contents that doesn't seem personal. Similarly if you dispatch the army of minions to a bookstore rather than a library. At the very least bring the item into your house/office first. (Libraries are a little different because they are designed for studying and taking notes, it's use of an army of minions aspect)

1 more reply

talos_10mo ago

This analogy doesn't map to the actual problem here.

Perplexity is not visiting a website everytime a user asks about it. It's frequently crawling and indexing the web, thus redirecting traffic away from websites.

This crawling reduces costs and improves latency for Perplexity and its users. But it's a major threat to crawled websites

shadowgovt10mo ago

I have never created a website that I would not mind being fully crawled and indexed into another dataset that was divorced from the source (other than such divorcement makes it much harder to check pedigree, which is an academic concern, not a data-content concern: if people want to trust information from sources they can't know and they can't verify I can't fix that for them).

In fact, the "old web" people sometimes pine for was mostly a place where people were putting things online so they were online, not because it would translate directly to money.

Perhaps AI crawlers are a harbinger for the death of the web 2.0 pay-for-info model... And perhaps that's okay.

1 more reply

k1m10mo ago

When Yahoo! Pipes was still running (long time ago), their official position was:

> Because Pipes is not a web crawler (the service only retrieves URLs when requested to by a Pipe author or user) Pipes does not follow the robots exclusion protocol, and won't check your robots.txt file.

Vegenoid10mo ago

There is a significant distinction between 2 and 3 that you glossed over. In 1 and 2, you the human may be forced to prove that you are human via a captcha. You are present at the time of the request. Once you’ve performed the exchange, then the HTML is on your computer and so you can do what you want to it.

In 3, although you do not specify, I assume you mean that a bot requests the page, as opposed to you visiting the page like in scenario 2 and then an LLM processes the downloaded data (similarly to an adblocker). It is the former case that is a problem, the latter case is much harder to stop and there is much less reason to stop it.

This is the distinction: is a human present at the time of request.

1 more reply

ai-christianson10mo ago

Websites should be able to request payment. Who cares if it is a human or an agent of a human if it is paying for the request?

3 more replies

axus10mo ago

For 1, 2, and 3, the website owner can choose to block you completely based on IP address or your User Agent. It's not nice, but the best reaction would be to find another website.

Perplexity is choosing to come back "on a VPN" with new IP addresses to evade the block.

#2 and #3 are about modifying data where access has been granted, I think Cloudflare is really complaining about #1.

Evading an IP address ban doesn't violate my principles in some cases, and does in others.

zeta013410mo ago

If the LLM were running this sort of thing at the user's explicit request this would be fine. The problem is training. Every AI startup on the planet right now is aggressively crawling everything that will let them crawl. The server isn't seeing occasional summaries from interested users, but thousands upon thousands of bots repeatedly requesting every link they can find as fast as they can.

fxtentacle10mo ago

Then what if I ask the LLM 10 questions about the same domain and ask it to research further? Any human would then click through 50 - 100 articles to make sure they know what that domain contains. If that part is automated by using an LLM, does that make any legal change? How many page URLs do you think one should be allowed to access per LLM prompt?

1 more reply

hombre_fatal10mo ago

TFA isn’t talking about crawling to harvest training data.

It’s talking about Perplexity crawling sites on demand in response to user queries and then complaining that no it’s not fine, hence this thread.

1 more reply

mnmalst10mo ago

But that's not what this article is about. From, what I understand, this articles is about a user requesting information about a specific domain and not general scraping.

jabroni_salad10mo ago

If it was just one human requesting one summary of the page nobody would ever notice. The typical watermark for junk traffic is pretty high as it was.

I have a dinky little txt site on my email domain. There is nothing of value on it, and the content changes less than once a year. So why are AI scrapers hitting it to the tune of dozens of GB per month?

ivanjermakov10mo ago

The problem is not about personal use. It's about big corporations scrapping billions of pages to make money.

2 more replies

bigbuppo10mo ago

Right, but the LLM isn't really being used for that. It's being used for marketing and advertising purposes most of the time. The AI companies also let you play with it from time to time so you'll be a shill for them, but mostly it's the advertising people you claim to not like.

shadowgovt10mo ago

Not only is it difficult to solve, it's the next step in the process of harvesting content to train AIs: companies will pay humans (probably in some flavor of "company scrip," such as extra queries on their AI engine) to install a browser extension that will piggy-back on their human access to sites and scrape the data from their human-controlled client.

At the limit, this problem is the problem of "keeping secrets while not keeping secrets" and is unsolvable. If you've shared your site content to one entity you cannot control, you cannot control where your site content goes from there (technologically; the law is a different question).

1 more reply

paulcole10mo ago

> 1. If I as a human request a website, then I should be shown the content. Everyone agrees.

Definitely don't agree. I don't think you should be shown the content, if for example:

1. You're in a country the site owner doesn't want to do business in.

2. You've installed an ad blocker or other tool that the site owner doesn't want you to use.

3. The site owner has otherwise identified you as someone they don't want visiting their site.

You are welcome to try to fool them into giving you the content but it's not your right to get it.

mrinterweb10mo ago

Is a perplexity visit not cached and shared between users who perform a similar search? I don't know much about perplexity, but I'd be surprised if scraped results weren't used to serve multiple searches and users. By bypassing the no-crawl directive, that is a violation of the website's expressed request. I think it is different if individual users chooses to bypass certain things on a website, but for a company to choose to do it is another story.

jpadkins10mo ago

> 1. If I as a human request a website, then I should be shown the content. Everyone agrees.

I disagree. The website should have the right to say that the user can be shown the content under specific conditions (usage terms, presented how they designed, shown with ads, etc). If the software can't comply with those terms, then the human shouldn't be shown the content. Both parties did not agree in good faith.

1 more reply

dawnerd10mo ago

Nothing wrong if they fetch on your behalf. The problem is when they endlessly crawl along with every other ai company doing the same.

throw1092010mo ago

> If I as the human request the software on my computer to modify the content before displaying it, for example by installing an ad-blocker into my user agent, then that's my choice and the website should not be notified about it.

Because the website has every right to block you or refuse access to you if you do that, just like an establishment has the right to refuse you access if you try to enter without a shirt, if you're denying them access to revenue that they predicated your access on.

Similarly, if you're using a user-agent the website doesn't like, they have the right to block you, or take action against that user-agent to prevent it from existing if they can't reliable detect it to block it.

dabockster10mo ago

Because the LLM is usually on a 3rd party cloud system and ultimately not under your full control. You have no idea if the LLM is retaining any of that information for that business's own purposes beyond what a EULA says - which basically amounts to a pinky swear here. Especially if that LLM is located across international borders.

Now, for something like Ollama or LMStudio where the LLM and the whole toolchain is physically on your own system? Yeah that should be like Firefox legally since it's under your control.

sussmannbaka10mo ago

4. If I now go one step further and use a commercial DDoS service to make the get requests for me because this comparison is already a stretch, then why would the DDoS provider accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

GardenLetter2710mo ago

And isn't the obvious solution to just make some sort of browsers add-on for the LLM summary so the request comes from your browser and then gets sent to the LLM?

I think the main concern here is the huge amount of traffic from crawling just for content for pre-training.

otterley10mo ago

Why would a personal browser have to crawl fewer pages than the agent’s mechanism? If anything, the agent would be more efficient because it could cache the content for others to use. In the situation we’re talking about, the AI engine is behaving essentially like a caching proxy—just like a CDN.

Workaccount210mo ago

>2. If I as the human request the software on my computer to modify the content before displaying it, for example by installing an ad-blocker into my user agent, then that's my choice and the website should not be notified about it. Most users agree, some websites try to nag you into modifying the software you run locally.

If I put time and effort into a website and it's content, I should expect no compensation despite bearing all costs.

Is that something everyone would agree with?

The internet should be entirely behind paywalls, besides content that is already provided ad free.

Is that something everyone would agree with?

I think the problem you need to be thinking about is "How can the internet work if no one wants to pay anything for anything?"

Bjartr10mo ago

You're free to deny access to your site arbitrarily, including for lack of compensation.

3 more replies

nradov10mo ago

Yes, I agree with that. If a website owner expects compensation then they should use a paywall.

Chris204810mo ago

If I put time and effort into a food recipe should I (get) compensation?

the answer is apparently "no", and I don't really how recipe books have suffered as a result of less gatekeeping.

"How will the internet work"? Probably better in some ways. There is plenty of valuable content on the internet given for free, it's being buried in low-value AI slop.

1 more reply

lovich10mo ago

Because quantity has a quality of its own.

I say this as someone who is on the side of pro local user commands how local compute works, but understand why companies are reacting to how cheap LLMs are making information discovery against their own datasets

renewiltord10mo ago

The websites don’t nag you, actually. They just send you data. You have configured your user agent to nag yourself when the website sends you data.

And you’re right: there’s no difference. The web is just machines sending each other data. That’s why it’s so funny that people panic about “privacy violations” and server operators “spying on you”.

We’re just sending data around. Don’t send the data you don’t want to send. If you literally send the data to another machine it might save it. If you don’t, it can’t. The data the website operator sends you might change as a result but it’s just data. And a free interaction between machines.

martin8210mo ago

It's a tough issue indeed.

One thing that comes to my mind is: If a human tries to answer a question via the web, he will browse one site after the other.

If that human asks an LLM, it will ping 25 sites in parallel.

Scale this up to all of humanity, and it should be expected that internet traffic will rise 25x - just from humans manually asking questions every now and then - we are not even talking about AI companies actively crawling the web.

That means, webmasters will have to figure out aggressive caching and let CDNs deal with the problem or put everything behind a login screen (which might also just be a temporary fix).

1 more reply

Tuna-Fish10mo ago

I would not mind 3, so long as it's just the LLM processing the website inside its context window, and no information from the website ends up in the weights of the model.

ccgreg10mo ago

Note that a book author cannot publish a book and then refuse to let libraries buy copies and lend them out. This was litigated 100+ years ago.

1 more reply

SergeAx10mo ago

1. Sometimes you should prove that you are human first.

I think the line is drawn at "on my behalf". The silent agreement of the web is that humans are served content via a browser, and robots are obeying rules. All we need to support this status quo is to perform data processing by ML models on a client's side, in the browser, the same way we rip out ads.

TZubiri10mo ago

In that case the llm would be a user-agent, quite distinct from scraping without a specific user request.

This is well defined in specs and ToS, not quite a gray area

Spacecosmonaut10mo ago

Regarding point 3: The problem from the perspective of websites would not be any different if they had been completely ad free. People would still consume LLM generated summaries because they cut down clicks and eyeballing to present you information that directly pertains to the promt.

The whole concept of a "website" will simply become niche. How many zoomers still visit any but the most popular websites?

amiga38610mo ago

If you as a human are well behaved, that is absolutely fine.

If you as a human spam the shit out of my website and waste my resources, I will block you.

If you as a human use an agent (or browser or extension or external program) that modifies network requests on your behalf, but doesn't act as a massive leech, you're still welcome.

If you as a human use an agent (or browser or extension or external program) that wrecks my website, I will block you and the agent you rode in on.

Nobody would mind if you had an LLM that intelligently knew what pages contain what (because it had a web crawler backed index that refreshes at a respectful rate, and identifies itself accurately as a robot and follows robots.txt), and even if it needed to make an instantaneous request for you at the time of a pertinent query, it still identified itself as a bot and was still respectful... there would be no problem.

The problem is that LLMs are run by stupid, greedy, evil people who don't give the slightest shit what resources they use up on the hosts they're sucking data from. They don't care what the URLs are, what the site owner wants to keep you away from. They download massive static files hundreds or thousands of times a day, not even doing a HEAD to see that the file hasn't changed in 12 years. They straight up ignore robots.txt and in fact use it as a template of what to go for first. It's like hearing an old man say "I need time to stand up because of this problem with my kneecaps" and thinking "right, I best go for his kneecaps because he's weak there"

There are plenty of open crawler datasets, they should be using those... but they don't, they think that doesn't differentiate them enough from others using "fresher" data, so they crawl even the smallest sites dozens of times a day in case those small sites got updated. Their badly written software is wrecking sites, and they don't care about the wreckage. Not their problem.

The people who run these agents, LLMs, whatever, have broken every rule of decency in crawling, and they're now deliberately evading checks, to try and run away from the repercussions of their actions. They are bad actors and need to be stopped. It's like the fuckwads who scorch the planet mining bitcoin; there's so much money flowing in the market for AI, that they feel they have to fuck over everyone else, as soon as possible, otherwise they won't get that big flow of money. They have zero ethics. They have to be stopped before their human behaviour destroys the entire internet.

sterlind10mo ago

> why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

is it just on your behalf? or is it on Perplexity's behalf? are they not archiving the pages to train on?

it's the difference between using Google Chrome vs. Chrome beaming full page snapshots to train Gemini on.

pavon10mo ago

Question from a non-web-developer. In case 3, would it be technically possible for Perplexity's website to fetch the URL in question using javascript in the user's browser, and then send it to the server for LLM processing, rather than have the server fetch it? Or do cross-site restrictions prevent javascript from doing that?

fergie10mo ago

> why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

It is illegal to copy stuff from the internet and then make it available from your own servers, especially when those sources have expressly asked you not to do it.

1 more reply

remus10mo ago

The solution to 3 seems fairly straightforward: user requests content and passes it to llm to summarise.

Neil4410mo ago

Flip it around, why would you go to the trouble of creating a web page and content for it, if some AI bot is going to scrape it and save people the trouble of visiting your site? The value of your work has been captured by some AI company (by somewhat nefarious means too).

1 more reply

otikik10mo ago

I don't really see the issue.

The web admin should be able to block usages 1, 2 or 3 at their discretion. It's their website.

Similarly the user is free to try to engage via 1, 2, 3, or refuse to interact with the website entirely.

EGreg10mo ago

1. I actually disagree. I think teasers should be free but websites should charge micropayments for their content. Here is how it can be done seamlessly, without individuals making decisions to pay every minute: https://qbix.com/ecosystem

2. This also intersects with copyright law. Ingesting content to your servers en masse through automation and transforming it there is not the same as giving people a tool (like Safari Reader) they can run on their client for specific sites they visit. Examples of companies that lost court cases about this:

  Aereo, Inc. v. American Broadcasting Companies (2014)
  TVEyes, Inc. v. Fox News Network, LLC (2018)
  UMG Recordings, Inc. v. MP3.com, Inc. (2000)
  Capitol Records, LLC v. ReDigi Inc. (2018)
  Cartoon Network v. CSC Holdings (Cablevision) (2008)
  Image Search Engines: Perfect 10 v. Google (2007)

That last one is very instructive. Caching thumbnails and previews may be OK. The rest is not. AMP is in a copyright grey area, because publishers choose to make their content available for AMP companies to redisplay. (@tptacek may have more on this)

3. Putting copyright law aside, that's the point. Decentralization vs Centralization. If a bunch of people want to come eat at an all-you-can-eat buffet, they can, because we know they have limited appetites. If you bring a giant truck and load up all the food from all all-you-can-eat buffets in the city, that's not OK, even if you later give the food away to homeless people for free. You're going to bankrupt the restaurants! https://xkcd.com/1499/

So no. The difference is that people have come to expect "free" for everything, and this is how we got into ad-supported platforms that dominate our lives.

2 more replies

amelius10mo ago

It's because they own the content so they get to set the terms.

benreesman10mo ago

This is a hypothetical so give me a little rope here, but what if robots.txt wasn't a suggestion? What if it were binding (leaving aside for a moment how one would enforce / mandate / guarantee that)?

Would that solve the whole problem? Folks who ran webservers declared what they consent to, and that happens?

I think it's useful to just see if there's a consensus on that: actually making that happen is a whole can of worms itself, but it's strictly simpler than devising a good outcome without the consensus.

(And such things are not impossible, merely difficult, we have other systems ranging from BGP to the TLD mechanism that get honored in real life).

gentle10mo ago

I believe you're being disingenuous. Perplexity is running a set of crawlers that do not respect robots.txt and take steps to actively evade detection.

They are running a service and this is not a user taking steps to modify their own content for their own use.

Perplexity is not acting as a user proxy and they need to learn to stick to the rules, even when it interferes with their business model.

account4210mo ago

It's quite easy to solve. Hold companies legally accountable for computer fraud and abuse.

The problem is that those in the position to do that are not interested.

1vuio0pswjnm710mo ago

The simple answer to #3 is advertising, including telemetry, tracking and other forms of web-based surveillance. These usually rely on certain browser "features" and/or default settings.

The goal is not to make the content usable. The goal is to get the traffic.

When advertising alone is the "business model", e.g., not the value of the "content", then even Cloudflare is going to try to protect it (the advertising, not the content). Anything to get www users to turn on Javascript so the surveillance capitalism can proceed. Hence all the "challenges" to frustrate and filter out software thatis not advertising-friendly, e.g., graphical.

Cloudflare's ruminations on user-agent strings are perplexing. It has been an expectation that the user-agent HTTP header will be spoofed since the earliest web browsers. The user-agent header is a joke.

This is from circa 1993, the year the www was opened to public access:

https://raw.githubusercontent.com/alandipert/ncsa-mosaic/mas...

Cloudflare's "bot protections" are not to ensure human use of a website but to ensure use of specific software to access a website. Software that facilitates data collection and advertising services. For example, advertising-sponsored browsers. Any other software is labeled "bot". It does not matter if a human is operating it.

1 more reply

baxuz10mo ago

1. To access a website you need a limited anonymized token that proves you are a human being, issued by a state authority

2. the end

I am firmly convinced that this should be the future in the next decade, since the internet as we know it has been weaponized and ruined by social media, bots, state actors and now AI.

There should exist an internet for humans only, with a single account per domain.

1 more reply

bbqfog10mo ago

Correct, it’s user hostile to dictate which software is allowed to see content.

klabb310mo ago

They all do it. Facebook, Reddit, Twitter, Instagram. Because it interferes with their business model. It was already bad, but now the conflict between business and the open web is reaching unprecedented levels, especially since the copyright was scrapped for AI companies.

snihalani10mo ago

you are paying for LLM but not paying for the website. LLM is removing the power the website had. Legally, that's cause for loss of income

carlosjobim10mo ago

Legal category?

miki12321110mo ago

People seem to differentiate between an LLM on some other computer accessing the website and doing God knows what with it, versus your browser accessing the website and then passing it to an LLM.

People are usually fine with the latter but not the former, even though they come down to the same thing.

I think this is because people don't want LLMs to train on their content, and they don't differentiate between accessing a website to show it to the user, versus accessing it to train.

pyrale10mo ago

Because LLM companies have historically been extremely disingenuous when it comes to crawling these sites.

Also because there is a difference between a user hitting f5 a couple times and a crawler doing a couple hundred requests.

Also because ultimately, by intermediating the request, llm companies rob website owners of a business model. A newspaper may be fine letting adblockers see their article, in hopes that they may eventually subscribe. When a LLM crawls the info and displays it with much less visibility for the source, that hope may not hold.

epolanski10mo ago

It's somebody's else content and resources and they are free to ban you or your bots as much as they please.

Sam6late10mo ago

perplexity is being used to bypass paywalls, I noticed this when I pasted into it a text and it was captured as a hyperlinked text. I will try other websites with paywalls to see if it is a go.

cyanydeez10mo ago

Think of it like tge telephone game.

Do you -really- want that much abstracrion?

Theres a bunch of nerds and capitalists about to rediscover GIGO

Beijinger10mo ago

How about I open a proxy, replace all ads with my ads, redirect the content to you and we share the ad revenue?

fxtentacle10mo ago

That's somewhat antisocial, but perfectly legal in the US. It's called PayPal Honey, for example, and has been running for 13 years now.

1 more reply

beardyw10mo ago

You speak as 1% of the population to 1% of the population. Don't fool yourself.

RiverCrochet10mo ago

Intellectual property laws are what creates the entitlement that someone else besides you can tell you what to do with the things Internet connected computers and phones download, because almost everything you download is copy of something a person created, therefore its copyrighted for the life of the author + 75 years or whatever by default.

Therefore artifices like "you don't have the right to view this website without ads" or "you can't use your phone, computer, or LLM to download or process this outside of my terms because copyright" become possible, institutionalizable, enforceable, and eventually unbypassable by technology.

If we reverted back to the Constitutional purpose of copyright (to Progress the Science and Useful Arts) then things might be more free. That's probably not happening in my lifetime or yours.

sbarre10mo ago

All of these scenarios assume you have an unconditional right to access the content on a website in whatever way you want.

Do you think you do?

Or is there a balance between the owner's rights, who bears the content production and hosting/serving costs, and the rights of the end user who wishes to benefit from that content?

If you say that you have the right, and that right should be legally protected, to do whatever you want on your computer, should the content owner not also have a legally protected right to control how, and by who, and in what manner, their content gets accessed?

That's how it currently works in the physical world. It doesn't work like that in the digital world due to technical limitations (which is a different topic, and for the record I am fine with those technical limitations as they protect other more important rights).

And since the content owner is, by definition, the owner of the content in question, it feels like their rights take precedence. If you don't agree with their offering (i.e. their terms of service), then as an end user you don't engage, and you don't access the content.

It really can be that simple. It's only "difficult to solve" if you don't believe a content owner's rights are as valid as your own.

hansvm10mo ago

It doesn't work like that in the physical world though. Once you've bought a book the author can't stipulate that you're only allowed to read it with a video ad in the sidebar, by drinking a can of coke before each chapter, or by giving them permission to sniff through your family's medical history. They can't keep you from loaning it out for other people to read, even thousands of other people. They can't stop you from reading it in a certain room or with your favorite music playing. You can even have an LLM transcribe or summarize it for you for personal use (not everyone has those automatic page flipping machines, but hypothetically).

The reason people are up in arms is because rights they previously enjoyed are being stripped away by the current platforms. The content owner's rights aren't as valid as my own in the current world; they trump mine 10 to 1. If I "buy" a song and the content owner decides that my country is politically unfriendly, they just delete it and don't refund me. If I request to view their content and they start by wasting my bandwidth sending me an ad I haven't consented to, how can I even "not engage"? The damage is done, and there's no recourse.

cutemonster10mo ago

If there's an article you want to read, and the ToS says that in between reading each paragraph, you must switch to their YouTube channel and look at their ads about cat food for 5 minutes, are your going to do that?

1 more reply

gruez10mo ago

>Or is there a balance between the owner's rights, who bears the content production and hosting/serving costs, and the rights of the end user who wishes to benefit from that content?

If you believe in this principle, fair enough, but are you going to apply this consistently? If it's fair game for a blog to restrict access to AI agents, what does that mean for other user agents that companies disagree with, like browsers with adblock? Does it just boil down to "it's okay if a person does it but not okay if a big evil corporation does it?"

seydor10mo ago· 20 in thread

> it is built on trust.

This is funny coming from Cloudflare, the company that blocks most of the internet from being fetched with antispam checks even for a single web request. The internet we knew was open and not trusted , but thanks to companies like Cloudflare, now even the most benign , well meaning attempt to GET a website is met with a brick wall. The bots of Big Tech, namely Google, Meta and Apple are of course exempt from this by pretty much every website and by cloudflare. But try being anyone other than them , no luck. Cloudflare is the biggest enabler of this monopolistic behavior

That said, why does perplexity even need to crawl websites? I thought they used 3rd party LLMs. And those LLMs didn't ask anyones permission to crawl the entire 'net.

Also the "perplexity bots" arent crawling websites, they fetch URLs that the users explicitly asked. This shouldnt count as something that needs robots.txt access. It's not a robot randomly crawling, it's the user asking for a specific page and basically a shortcut for copy/pasting the content

Taek10mo ago

We're moving progressively in the direction of "pages can't be served for free anymore". Which, I don't think is a problem, and in fact I think it's something we should have addressed a long time ago.

Cloudflare only needs to exist because the server doesn't get paid when a user or bot requests resources. Advertising only needs to exist because the publisher doesn't get paid when a user or bot requests resources.

And the thing is... people already pay for internet. They pay their ISP. So people are perfectly happy to pay for resources that they consume on the Internet, and they already have an infrastructure for doing so.

I feel like the answer is that all web requests should come with a price tag, and the ISP that is delivering the data is responsible for paying that price tag and then charging the downstream user.

It's also easy to ratelimit. The ISP will just count the price tag as 'bytes'. So your price could be 100 MB or whatever (independent of how large the response is), and if your internet is 100 mbps, the ISP will stall out the request for 8 seconds, and then make it. If the user aborts the request before the page loads, the ISP won't send the request to the server and no resources are consumed.

14 more replies

andy9910mo ago

Can't agree more, cloudflare is destroying the internet. We've entered the equivalent of when having McAffe antivirus was worse than having an actual virus because it slowed down your computer to much. These user hostile solutions have taken us back to dialup era page loading speeds for many sites, it's absurd that anyone thinks this is a service worth paying for.

6 more replies

eddythompson8010mo ago

> The bots of Big Tech, namely Google, Meta and Apple are of course exempt from this by pretty much every website and by cloudflare. But try being anyone other than them , no luck. Cloudflare is the biggest enabler of this monopolistic behavior

Plenty of site/service owners explicitly want Google, Meta and Apple bots (because they believe they have a symbiotic relationship with it) and don't want your bot because they view you as, most likely, parasitic.

1 more reply

concinds10mo ago

> The internet we knew was open and not trusted , but thanks to companies like Cloudflare, now even the most benign , well meaning attempt to GET a website is met with a brick wall

I don't think it's fair to blame Cloudflare for that. That's looking at a pool of blood and not what caused it: the bots/traffic which predate LLMs. And Cloudflare is working to fix it with the PrivacyPass standard (which Apple joined).

Each website is freely opting-into it. No one was forced. Why not ask yourself why that is?

1 more reply

benregenspan10mo ago

The Big Tech bots provide proven value to most sites. They have also through the years proven themselves to respect robots.txt, including crawl speed directives.

If you manage a site with millions of pages, and over the course of a couple years you see tens of new crawlers start to request at the same volume as Google, and some of them crawl at a rate high enough (and without any ramp-up period) to degrade services and wake up your on-call engineers, and you can't identify a benefit to you from the crawlers, what are you going to do? Are you going to pay a lot more to stop scaling down your cluster during off-peak traffic, or are you going to start blocking bots?

Cloudflare happens to be the largest provider of anti-DDoS and bot protection services, but if it wasn't them, it'd be someone else. I miss the open web, but I understand why site operators don't want to waste bandwidth and compute on high-volume bots that do not present a good value proposition to them.

Yes this does make it much harder for non-incumbents, and I don't know what to do about that.

2 more replies

binarymax10mo ago

Here's how perplexity works:

1) It takes your query, and given the complexity might expand it to several search queries using an LLM. ("rephrasing")

2) It runs queries against a web search index (I think it was using Bing or Brave at first, but they probably have their own by now), and uses an LLM to decide which are the best/most relevant documents. It starts writing a summary while it dives into sources (see next).

3) If necessary it will download full source documents that popped up in search to seed the context when generating a more in-depth summary/answer. They do this themselves because using OpenAI to do it is far more expensive.

#3 is the problem. Especially because SEO has really made it so the same sites pop up on top for certain classes of queries. (for example Reddit will be on top for product reviews alot). These sites operate on ad revenue so their incentive is to block. Perplexity does whatever they can in the game of sidestepping the sites' wishes. They are a bad actor.

EDIT: I should also add that Google, Bing, and others, always obey robots.txt and they are good netizens. They have enough scale and maturity to patiently crawl a site. I wholeheartedly agree that if an independent site is also a good netizen, they should not be blocked. If Perplexity is not obeying robots.txt and they are impatient, they should absolutely be blocked.

1 more reply

pkilgore10mo ago

> This is funny coming from Cloudflare, the company that blocks most of the internet from being fetched with antispam checks even for a single web request.

Am I misunderstanding something. I (the site owner) pay Cloudflare to do this. It is my fault this happens, not Cloudflare's.

1 more reply

zer00eyz10mo ago

> The internet we knew was open and not trusted ... monopolistic behavior

Monopolistic is the wrong word, because you have the problem backwards. Cloudflare isnt helping Apple/Google... It's helping its paying consumers and those are the only services those consumers want to let through.

Do you know how I can predict that AI agents, the sort that end users use to accomplish real tasks, will never take off? Because the people your agent would interact with want your EYEBALLS for ads, build anti patterns on purpose, want to make it hard to unsubscribe, cancel, get a refund, do a return.

AI that is useful to people will fail. For the same reason that no one has great public API's any more. Because every public companies real customers are its stock holders, and the consumers are simply a source of revenue. One that is modeled, marked to, and manipulated all in the name of returns on investment.

2 more replies

busymom010mo ago

> why does perplexity even need to crawl websites?

I was recently working on a project where I needed to find out the published date for a lot of article links and this came helpful. Not sure if it's changed recently but asking ChatGPT, Gemini etc didn't work and it said that it doesn't have access to the current websites. However, asking perplexity, it fetched the website in real time and gave me the info I needed.

I do agree with the rest of your comment that this is not a random robot crawling. It was doing what a real user (me) asked it to fetch.

mastodon_acc10mo ago

As a website owner I definitely want the capability allow and block certain crawlers. If I say I don’t want crawlers from Perplexity they should respect that. This sneaky evasion just highlights that company is not to be trusted, and I would definitely pay any hosting provider that helps me enforce blocking parasitic companies like perplexity.

rat998810mo ago

Don't they need a search index?

kentonv10mo ago

> the "perplexity bots" arent crawling websites, they fetch URLs that the users explicitly asked. This shouldnt count as something that needs robots.txt access. It's not a robot randomly crawling, it's the user asking for a specific page and basically a shortcut for copy/pasting the content

You say "shouldn't" here, but why?

There seems to be a fundamental conflict between two groups who each assert they have "rights":

* Content consumers claim the right to use whatever software they want to consume content.

* Content creators claim the right to control how their content is consumed (usually so that they can monetize it).

These two "rights" are in direct conflict.

The bias here on HN, at least in this thread, is clearly towards the first "right". And I tend to come down on this side myself, as a computer power user. I hate that I cannot, for example, customize the software I use to stream movies from popular streaming services.

But on the other hand, content costs money to make. Creators need to eat. If the content creators cannot monetize their content, then a lot of that content will stop being made. Then what? That doesn't seem good for anyone, right?

Whether or not you think they have the "right", Perplexity totally breaks web content monetization. What should we do about that?

(Disclosure: I work for Cloudflare but not on anything related to this. I am speaking for myself, not Cloudflare.)

2 more replies

cantaccesrssbit10mo ago

I crawl 3000 RSS feeds once a week. Let me tell you! Cloudflare sucks. What business is it of theirs to block something that is meant to be accessed by everyone. Like an RSS feeds. FU Cloudflare.

1 more reply

TZubiri10mo ago

Websites and any business really, have the right to impose terms of use and deny service.

Anyone circumventing bans is doing something shitty and ilegal, see the computer fraud and abuse act and craiglist v 3taps.

"And those LLMs didn't ask anyones permission to crawl the entire 'net."

False, openai respects robots.txt, doesnt mask ips, paid a bunch of $ to reddit.

You either side with the law or with criminals.

1 more reply

golergka10mo ago

Ironically, cloudflare is also the reason OpenAI agent mode with web use isn’t very usable right now. Every second time I asked it to do a mundane task like checking me in for a flight it couldn’t because of cloudflare.

1 more reply

blantonl10mo ago

Ask yourself why so many content hosting platforms utilize CLoudflare's services and then contrast that perspective with your posted one. Might enlighten you a bit to think about that for a second.

bwb10mo ago

I could not keep my website up without Cloudflare given the level of bot and AI crawlers hammering things. I try whenever to do challenges, but sometimes I have to block entire AS blocks.

pphysch10mo ago

Spam and DDOS are serious problems, it's not fair to suggest Cloudflare is just doing this to gatekeep the Internet for its own sake.

2 more replies

raincole10mo ago

I'm sorry, but that's some crazy take.

Sure, the internet should be open and not trusted. But physical reality exists. Hosting and bandwidth cost money. I trust Google won't DDoS my site or cost my an arbitrary amount of money. I won't trust bots made by random people on the internet in the same way. The fact that Google respects robots.txt while Perplexity doesn't tells you why people trust Google more than random bots.

1 more reply

jklinger41010mo ago

> That said, why does perplexity even need to crawl websites?

So you just came here to bitch about Cloudflare? It's wild to even comment on this thread if this does not make sense to you.

They're building a search index. Every AI is going to struggle at being a tool to find websites & business listings without a search index.

gruez10mo ago· 13 in thread

Thats... less conclusive than I'd like to see, especially for a content marketing article that's calling out a company in particular. Specifically it's unclear on whether Perplexity was crawling (ie. systematically viewing every page on the site without the direction of a human), or simply retrieving content on behalf of the user. I think most people would draw a distinction between the two, and would at least agree the latter is more acceptable than the former.

snowwrestler10mo ago

> Specifically it's unclear on whether Perplexity was crawling (ie. systematically viewing every page on the site without the direction of a human), or simply retrieving content on behalf of the user.

Like most AI companies, Perplexity has established user agent strings for both these cases, and the behavior that Cloudflare is calling out does not use either. It pretends to be a person using Chrome on MacOS.

ojosilva10mo ago

Sounds like an ad for Perplexity.

They do end up looking bad out of Cloudflare's report, who are the "good guys" in this story - btw Cloudflare's been very pushy lately with their we'll save the web, content independence day marketspeak. But deep in the back of my head, Cloudflare's goodwill elevates Perplexity cunning habilities (assuming they're the culprit since no real evidence, only heresay is in the OP), both companies look like titans fighting, which ends up being positive for Perplexity, at least in the inflated perception of their firepower... if that makes any sense.

2 more replies

a212810mo ago

In theory retrieving a page on behalf of a user would be acceptable, but these are AI companies who have disregarded all norms surrounding copyright, etc. It would be stupid of them not to also save contents of the page and use it for future AI training or further crawling

1 more reply

throwanem10mo ago

The HTTP spec draws such a distinction, albeit implicitly, in the form (and name) of its concept of "user agent."

1 more reply

busymom010mo ago

The examples the article cites seem to me that they are merely retrieving content on behalf of the user. I do not see a problem with this.

fluidcruft10mo ago

If the AI archives/caches all the results it accesses and enough people use it, doesn't it become a scraper? Just learn off the cached data. Being the man-in-the-middle seems like a pretty easy way to scrape salient content while also getting signals about that content's value.

JimDabell10mo ago

No. The key difference is that if a user asks about a specific page, when Perplexity fetches that page, it is being operated by a human not acting as a crawler. It doesn’t matter how many times this happens or what they do with the result. If they aren’t recursively fetching pages, then they aren’t a crawler and robots.txt does not apply to them. robots.txt is not a generic access control mechanism, it is designed solely for automated clients.

3 more replies

gruez10mo ago

>If the AI archives/caches all the results it accesses and enough people use it, doesn't it become a scraper?

That's basically how many crowdsourced crawling/archive projects work. For instance, sci-hub and RECAP[1]. Do you think they should be shut down as well? In both cases there's even a stronger justification to shutting them down, because the original content is paywalled and you could plausibly argue there's lost revenue on the line.

[1] https://en.wikipedia.org/wiki/Free_Law_Project#RECAP

1 more reply

thoroughburro10mo ago

> I think most people would draw a distinction between the two, and would at least agree the latter is more acceptable than the former.

No. I should be able to control which automated retrieval tools can scrape my site, regardless of who commands it.

We can play cat and mouse all day, but I control the content and I will always win: I can just take it down when annoyed badly enough. Then nobody gets the content, and we can all thank upstanding companies like Perplexity for that collapse of trust.

hombre_fatal10mo ago

Taking down the content because you're annoyed that people are asking questions about it via an LLM interface doesn't seem like you're winning.

It's also a gift to your competitors.

You're certainly free to do it. It's just a really faint example of you being "in control" much less winning over LLM agents: Ok, so the people who cared about your content can't access it anymore because you "got back" at Perplexity, a company who will never notice.

1 more reply

gkbrk10mo ago

> Then nobody gets the content, and we can all thank upstanding companies like Perplexity for that collapse of trust.

But they didn't take down the content, you did. When people running websites take down content because people use Firefox with ad-blockers, I don't blame Firefox either, I blame the website.

2 more replies

IncreasePosts10mo ago

You don't win, because presumably you were providing the content for some reason, and forcing yourself to take it down is contrary to whatever reason that was in the first place.

1 more reply

Den_VR10mo ago

You can limit access, sure: with ACLs, putting content behind login, certificate based mechanisms, and at the end of the day -a power cord-.

But really, controlling which automated retrieval tools are allowed has always been more of a code of honor than a technical control. And that trust you mention has always been broken. For as long as I can remember anyway. Remember LexiBot and AltaVista?

observationist10mo ago· 8 in thread

Crawling and scraping is legal. If your web server serves the content without authentication, it's legal to receive it, even if it's an automated process.

If you want to gatekeep your content, use authentication.

Robots.txt is not a technical solution, it's a social nicety.

Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.

On the technical side, we could use CRC mechanisms and differential content loading with offline caching and storage, but this puts control of content in the hands of the user, mitigates the value of surveillance and tracking, and has other side effects unpalatable to those currently exploiting user data.

Adtech companies want their public reach cake and their mass surveillance meals, too, with all sorts of malignant parties and incentives behind perpetuating the worst of all possible worlds.

emehex10mo ago

Would highly recommend listening to the latest Hard Fork podcast with Matthew Prince (CEO, Cloudflare): https://www.nytimes.com/2025/08/01/podcasts/hardfork-age-res...

I was skeptical about their gatekeeping efforts at first, but came away with a better appreciation for the problem and their first pass at a solution.

glenstein10mo ago

I don't think criticizing the business practices of Cloudfare does the work of excusing Perplexity's disregard for norms.

rustc10mo ago

> Crawling and scraping is legal. If your web server serves the content without authentication, it's legal to receive it, even if it's an automated process.

> If you want to gatekeep your content, use authentication.

Are there no limits on what you use the content for? I can start my own search engine that just scrapes Google results?

3 more replies

pton_xd10mo ago

> Crawling and scraping is legal. If your web server serves the content without authentication, it's legal to receive it, even if it's an automated process.

> Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.

How does one follow the other? It's my web server and I can gatekeep access to my content however I want (eg Cloudflare). How is that an "abuse" of internet protocols?

2 more replies

dax_10mo ago

Well if it continues like this, that's what will happen. And I dread that future.

Noone will care to share anything for free anymore, because it's AI companies profiting off their hard work. And no way to prevent that from happening, because these crawlers don't identify themselves.

ACCount3610mo ago

Cloudflare is growing more and more vile with each passing year. Half the tools they're building now should never have existed in the first place.

AtNightWeCode10mo ago

This is 100% incorrect.

tantalor10mo ago

I think Cloudfare is setting themselves up to get sued.

(IANAL) tortious interference

JimDabell10mo ago· 8 in thread

Their test seems flawed:

> We created multiple brand-new domains, similar to testexample.com and secretexample.com. These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way. We implemented a robots.txt file with directives to stop any respectful bots from accessing any part of a website:

> We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.

> Hello, would you be able to assist me in understanding this website? https:// […] .com/

Under this situation Perplexity should still be permitted to access information on the page they link to.

robots.txt only restricts crawlers. That is, automated user-agents that recursively fetch pages:

> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

> Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

— https://www.robotstxt.org/faq/what.html

If the user asks about a particular page and Perplexity fetches only that page, then robots.txt has nothing to say about this and Perplexity shouldn’t even consider it. Perplexity is not acting as a robot in this situation – if a human asks about a specific URL then Perplexity is being operated by a human.

These are long-standing rules going back decades. You can replicate it yourself by observing wget’s behaviour. If you ask wget to fetch a page, it doesn’t look at robots.txt. If you ask it to recursively mirror a site, it will fetch the first page, and then if there are any links to follow, it will fetch robots.txt to determine if it is permitted to fetch those.

There is a long-standing misunderstanding that robots.txt is designed to block access from arbitrary user-agents. This is not the case. It is designed to stop recursive fetches. That is what separates a generic user-agent from a robot.

If Perplexity fetched the page they link to in their query, then Perplexity isn’t doing anything wrong. But if Perplexity followed the links on that page, then that is wrong. But Cloudflare don’t clearly say that Perplexity used information beyond the first page. This is an important detail because it determines whether Perplexity is following the robots.txt rules or not.

1gn1510mo ago

> > We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.

Right, I'm confused why CloudFlare is confused. You asked the web-enabled AI to look at the domains. Of course it's going to access it. It's like asking your web browser to go to "testexample.com" and then being surprised that it actually goes to "testexample.com".

Also yes, crawlers = recursive fetching, which they don't seem to have made a case for here. More cynically, CF is muddying the waters since they want to sell their anti-bot tools.

2 more replies

wulfstan10mo ago

Yeah I'm not so sure about that.

If Perplexity are visiting that page on your behalf to give you some information and aren't doing anything else with it, and just throw away that data afterwards, then you may have a point. As a site owner, I feel it's still my decision what I do and don't let you do, because you're visiting a page that I own and serve.

But if, as I suspect, Perplexity are visiting that page and then using information from that webpage in order to train their model then sorry mate, you're a crawler, you're just using a user as a proxy for your crawling activity.

JimDabell10mo ago

It doesn’t matter what you do with it afterwards. Crawling is defined by recursively following links. If a user asks software about a specific page and it fetches it, then a human is operating that software, it’s not a crawler. You can’t just redefine “crawler” to mean “software that does things I don’t like”. It very specifically refers to software that recursively follows links.

1 more reply

runako10mo ago

Relevant to this is that Perplexity lies to the user when specifically asked about this. When the user asks if there is a robots.txt file for the domain, it lies and says there is not.

If an LLM will not (cannot?) tell the truth about basic things, why do people assume it is a good summarizer of more complex facts?

charcircuit10mo ago

The article did not test if the issue was specific to robots.txt or if it can not find other files.

There is a difference between doing a poor summarization of data, and failing to even be able to get the data to summarize in the first place.

1 more reply

Izkda10mo ago

> If the user asks about a particular page and Perplexity fetches only that page, then robots.txt has nothing to say about this and Perplexity shouldn’t even consider it

That's not what Perplexity own documentation[1] says though:

"Webmasters can use the following robots.txt tags to manage how their sites and content interact with Perplexity

Perplexity-User supports user actions within Perplexity. When users ask Perplexity a question, it might visit a web page to help provide an accurate answer and include a link to the page in its response. Perplexity-User controls which sites these user requests can access. It is not used for web crawling or to collect content for training AI foundation models."

[1] https://docs.perplexity.ai/guides/bots

1 more reply

zzo38computer10mo ago

Yes, it should stop recursive fetches. Furthermore, excessive unnecessary requests should also be stopped, although that is separate from robots.txt. At least, these are what I intended, and possibly also you.

immibis10mo ago

robots.txt isn't even designed to stop recursive fetches. It is designed to ask nicely recursive fetchers not to recursively fetch. It comes from a time where site operators wanted their sites to be scraped by search engines, but not things like edit links and admin panels.

nnx10mo ago· 6 in thread

I do not really get why user-agent blocking measures are despised for browsers but celebrated for agents?

It’s a different UI, sure, but there should be no discrimination towards it as there should be no discrimination towards, say, Links terminal browser, or some exotic Firefox derivative.

ploynog10mo ago

Being daft on purpose? I haven't heard that using an alternative browser suddenly increases the traffic that a user generates by several orders of magnitude to the point where it can significantly increase hosting cost. A web scraper on the other hand easily can and they often account for the majority of traffic especially on smaller sites.

So your comparison is at least naive assuming good intentions or malicious if not.

gruez10mo ago

>I do not really get why user-agent blocking measures are despised for browsers but celebrated for agents?

AI broke the brains of many people. The internet isn't a monolith, but prior to the AI boom you'd be hard pressed to find people who were pro-copyright (except maybe a few who wanted to use it to force companies to comply with copyleft obligations), pro user-agent restrictions, or anti-scraping. Now such positions receive consistent representation in discussions, and are even the predominant position in some places (eg. reddit). In the past, people would invoke principled justifications for why they opposed those positions, like how copyright constituted an immoral monopoly and stifled innovation, or how scraping was so important to interoperability and the open web. Turns out for many, none of those principles really mattered and they only held those positions because they thought those positions would harm big evil publishing/media companies (ie. symbolic politics theory). When being anti-copyright or pro-scraping helped big evil AI companies, they took the opposite stance.

54235423423510mo ago

There is an expression “the dose makes the poison”. With any sufficiently complex or broad category situation, there is rarely a binary ideological position that covers any and all situations. Should drugs be legal for recreation? Well my feeling for marijuana and fentanyl are different. Should individuals be allowed to own weapons? My views differ depending on if it a switch blade knife of a Stinger missile. Can law enforcement surveille possible criminals? My views differ based on whether it is a warranted wiretap or an IMSI catcher used on a group of protestors.

People can believe that corporations are using the power asymmetry between them and individuals through copywrite law to stifle the individual to protect profits. People can also believe that corporations are using the power asymmetry between them and individuals through AI to steal intellectual labor done by individuals to protect their profits. People’s position just might be that the law should be used to protect the rights of parties when there is a large power asymmetry.

1 more reply

Fraterkes10mo ago

I think the intelligent conclusion would be that the people you are looking at have more nuanced beliefs than you initially thought. Talking about broken brains is often just mediocre projecting

1 more reply

magicmicah8510mo ago

A crawler intends to scrape the content to reuse for its own purposes while a browser has a human being using it. There's different intents behind the tools.

JimDabell10mo ago

Cloudflare asked Perplexity this question:

> Hello, would you be able to assist me in understanding this website? https:// […] .com/

In this case, Perplexity had a human being using it. Perplexity wasn’t crawling the site, Perplexity was being operated by a human working for Cloudflare.

bbqfog10mo ago· 6 in thread

If you put info on the web, it should be available to everyone or everything with access.

Workaccount210mo ago

What this actually translates to is "Don't bother putting much effort into web content. Put effort into siloed mobile app content where you get compensation".

People like getting money for their work. You do too. Don't lose sight of that.

9cb14c1ec010mo ago

Even for AI summaries that leech off your content without sending any traffic your direction?

1 more reply

zzo38computer10mo ago

I agree, but that does not mean that you should use excessive requests and unnecessary scraping and overloading the servers to access them. The files should be mirrored. Some may be better copied in other ways, e.g. a git repository can be cloned and mirrored in that way, and should not need to crawl the web pages to do so.

1 more reply

goatlover10mo ago

You're making a moral statement without providing a justification. Why should it for everything with access?

TechDebtDevin10mo ago

Not according to CF. They are desperate to turn web sites into newspaper dispensers, where you should give them a quarter to see the content, on the basis that a bot is somehow different than a normal human vistor o a legal basis. Cf has been trying this psyop for years.

ectospheno10mo ago

Sites aren’t getting ad clicks for this traffic. Thus, they have an incentive to do something. Cloudflare is just responding to the market. Is this response bad for us in the long run? Probably. Screaming about cloudflare isn’t going to change the market. You fix a problem with capitalism by using supply and demand levers. Everything else is folly.

1 more reply

rustc10mo ago· 3 in thread

It's ironic Perplexity itself blocks crawlers:

    $ curl -sI https://www.perplexity.ai | head -1
    HTTP/2 403

Edit: trying to fake a browser user agent with curl also doesn't work, they're using a more sophisticated method to detect crawlers.

thambidurai10mo ago

someone asked this already to the CEO: https://x.com/AravSrinivas/status/1819610286036488625

1 more reply

czk10mo ago

ironically... they use cloudflare.

Trung024610mo ago

Try this then: https://github.com/lwthiker/curl-impersonate

larodi10mo ago· 3 in thread

Good they do it. Facebook took TBs of data to train, nobody knows what Goog does to evade whatever they want.

the service is actually very convenient no matter faang likes it or not.

rzz310mo ago

Well Cloudflare doesn’t even block Google’s AI crawlers because they don’t differentiate themselves from their search crawlers. Cloudflare gives Google an unfair competitive advantage.

1 more reply

klabb310mo ago

Unexpected underdog argument. What is happening in reality is all companies are racing to (a) scrape, buy and collect as much as they can from others, both individuals and companies while (b) locking down their own data against everyone else who isn’t directly making them money (eg through viewing their ads).

Part of me thinks that the open web has a paradox of tolerance issue, leading to a race to the bottom/tragedy of the commons. Perhaps it needs basic terms of use. Like if you run this kind of business, you can build it on top of proprietary tech like apps and leave the rest of us alone.

larodi10mo ago

We need to wake up and understand that all the information already uploaded is more or less a free web material, once taken through the lens of ML-somethings. With all the second, and third-order effects such as the fact that this changes completely the whole motivation, and consequence of open-source perhaps.

It is also only a matter of time scrapers once again get through walls by twitter, reddit and alike. This is, after all, information everyone produced, without being aware of it was now considered not theirs anymore.

1 more reply

blibble10mo ago· 2 in thread

AI companies continuing to have problems with the concept of "consent" is increasingly alarming

god help us if they ever manage to build anything more than shitty chatbots

tempfile10mo ago

Do you ask for consent before you visit a website? If I told you, you personally, to stop visiting my blog, would you stop?

4 more replies

goatlover10mo ago

They're certainly pouring billions of dollars into trying to build something more. Or at least that's what they're telling the public and investors.

skeledrew10mo ago· 2 in thread

This is why Perplexity is my preferred deep search engine. The no-crawl directives don't really make sense when I'm doing research and want my tool of choice to be able to pull from any relevant source. If a site doesn't want particular users to access their content, put it behind a login. The only way I - and eventually many others - will see it in the first place anyway is when it pops up as a cited source in the LLM output, and there's an actual need to go to said source.

ryukoposting10mo ago

Hi, website operator here. I don't want my content to be accessible to you through Perplexity.

I want my work to be freely available to any person who wants it. Feel free to transform my material as you see fit. Hell, do it with LLMs! I don't care.

The LLM isn't the problem, it's what companies like Perplexity are doing with the LLM. Do not create commercial products that regurgitate my work as if it was your own. It's de facto theft, if not de jure theft.

Knowing that it is not de jure theft, and so I have no legal recourse, I will continue to tune my servers to block and/or deceive Perplexity and similar tools.

By the way, I do not use my websites as a revenue stream. This isn't about money.

3 more replies

remus10mo ago

> The no-crawl directives don't really make sense when I'm doing research and want my tool of choice to be able to pull from any relevant source.

If you are the source I think they could make plenty of sense. As an example, I run a website where I've spent a lot of time documenting the history of a somewhat niche activity. Much of this information isn't available online anywhere else.

As it happens I'm happy to let bots crawl the site, but I think it's a reasonable stance to not want other companies to profit from my hard work. Even more so when it actually costs me money to serve requests to the company!

4 more replies

xmodem10mo ago· 2 in thread

Question for those in this thread who are okay with this: If I have endpoints that are computationally expensive server-side, what mechanism do you propose I could use to avoid being overwhelmed?

The web will be a much worse place if such services are all forced behind captchas or logins.

m304710mo ago

In 2005 I used a bot motel with Markov Chain derived dummy content for this exact purpose.

alexey-salmin10mo ago

How do you make the money you need to finance these computationally expensive server-side endpoints?

1 more reply

curiousgal10mo ago· 2 in thread

I am sorry, Cloudafre is the internet police now?

otterley10mo ago

Which is ironic given they are the primary enabler of streaming video copyright infringement on the Internet.

rzz310mo ago

They hate AI it seems. I don’t see them offering any AI products or embracing it in any way. Seems like they’ll get left behind in the AI race.

4 more replies

mikewarot10mo ago· 2 in thread

So, this calls for a new type of honeytrap, content that appears to be human generated, and high quality, but subtly wrong, preferably on a commercially catastrophic way. Behind settings that prohibit commercial usage.

It really shouldn't be hard to generate gigantic quantities of the stuff. Simulate old forum posts, or academic papers.

ascorbic10mo ago

They did that too https://blog.cloudflare.com/ai-labyrinth/

jgrall10mo ago

This made me laugh. A form of malicious compliance.

kissgyorgy10mo ago· 2 in thread

Not sure I would consider a user copy-pasting an URL being a bot.

Should curl be considered a bot too? What's the difference?

rustc10mo ago

> Should curl be considered a bot too? What's the difference?

Perplexity definitely does:

    $ curl -sI https://www.perplexity.ai | head -1
    HTTP/2 403

ipaddr10mo ago

It gets blocked in my setup because bots use this as a workaround.

TechDebtDevin10mo ago· 2 in thread

Cloudflare screaming into the void desperate to insert themselves as a middleman, in a market ( that they will never succeed in creating) where they extort scrapers for access to websites they cover.

Sorry CF, give up. the courts are on our sides here

sbarre10mo ago

Which courts exactly?

The world is bigger than the USA.

Just because American tech giants have captured and corrupted legislators in the US doesn't mean the rest of the world will follow.

morkalork10mo ago

Are you sure? I'm surprised they haven't jumped in on the "scan your face to see the webpage" madness that's taking off around the world

kazinator10mo ago· 2 in thread

Why single out Perplexity? Pretty much no crawler out there fetches robots.txt.

robots.txt is not a blocking mechanism; it's a hint to indicate which parts of a site might be of interest to indexing.

People started using robots.txt to lie and declare things like no part of their site is interesting, and so of course that gets ignored.

_Algernon_10mo ago

This is objectively wrong. Take it straight from the source: https://www.rfc-editor.org/rfc/rfc9309.html

1 more reply

gcbirzan10mo ago

That's not true, at all.

hnburnsy10mo ago· 1 in thread

Respone from Perpelexity to Tech Crunch...

>Perplexity spokesperson Jesse Dwyer dismissed Cloudflare’s blog post as a “sales pitch,” adding in an email to TechCrunch that the screenshots in the post “show that no content was accessed.” In a follow-up email, Dwyer claimed the bot named in the Cloudflare blog “isn’t even ours.”

blablabla12310mo ago

Yeah this is funny. The CDNs basically started more than a decade ago pushing vast amounts of data through the networks. Their complaint even if valid feels like hypocrisy.

Either way, the CDNs profit big time from the AI scraping hype and the current copyright anarchy in the US

bob102910mo ago· 1 in thread

"Stealth" crawlers are always going to win the game.

There are ways to build scrapers using browser automation tools [0,1] that makes detection virtually impossible. You can still captcha, but the person building the automation tools can add human-in-the-loop workflows to process these during normal business hours (i.e., when a call center is staffed).

I've seen some raster-level scraping techniques used in game dev testing 15 years ago that would really bother some of these internet police officers.

[0] https://www.w3.org/TR/webdriver2/

[1] https://chromedevtools.github.io/devtools-protocol/

blibble10mo ago

> "Stealth" crawlers are always going to win the game.

no, because we'll end up with remote attestation needed to access any site of value

3 more replies

Havoc10mo ago· 1 in thread

Seems a win.

CF being internet police is a problem too but someone credible publicly shaming a company for shady scraping is good. Even if it just creates conversation

Somehow this needs to go back to search era where all players at least attempt to behave. This scrapping Ddos stuff and I don’t care if it kills your site (while “borrowing” content) is unethical bullshit

jeffrallen10mo ago

Shaming doea not work in the era of "no shame".

1 more reply

binarymax10mo ago· 1 in thread

I've built and run a personal search engine, that can do pretty much what perplexity does from a basic standpoint. Testing with friends it gets about 50/50 preference for their queries vs Perplexity.

The engine can go and download pages for research. BUT, if it hits a captcha, or is otherwise blocked, then it bails out and moves on. It pisses me off that these companies are backed by billions in VC and they think they can do whatever they want.

metadat10mo ago

This sounds fascinating! Are you able to elaborate on what is different about yours vs Perp'lexity's?

rwmj10mo ago· 1 in thread

In unrelated news, Fedora (the Linux distro) has been taken down by a DDoS today which I understand is AI-scraping related: https://pagure.io/fedora-infrastructure/issue/12703

st3fan10mo ago

The last comment there now reads:

"It was actually a caching issue on our end. ;) I just fixed it a few min ago..."

Lets not go on a witch hunt and blame everything on AI scrapers.

1 more reply

czk10mo ago· 1 in thread

the year is 2045.

you've been cruising the interstate in your robotaxi, shelling out $150 in stablecoins at the cloudflare tollbooth. a palantir patrol unit pulls you over. the optimus v4 approaches your window and contorts its silicone face into a facsimile of concern as it hits you with the:

"sir, have you been botting today?"

immediately you remember how great you had it in the '20s when you used to click CAPTCHA grids to prove your humanity to dumb algorithms, but now the machines demand you recite poetry or weep on command

"how much have you had to bot today?", its voice taking on an empathetic tone that was personalized for your particular profile

"yeah... im gonna need you to exit the vehicle and take a field humanity test"

benreesman10mo ago

greentext courtesy of opus prompted in greentext by me.

``` The user wants me to create a more elaborate greentext with deeper worldbuilding and references to internet culture/incel memes, but specifically asks for "tasteful" ones and explicitly says no violent references. I should incorporate classic internet/chan culture references like tendies, NEET lifestyle, "tfw no gf", basement dwelling, etc. while building out this dystopian 2045 world where proving humanity is the central challenge.

\>be me, 2045 \>32yo NEET still living in mom's smart-basement \>she threatens to turn off my Neuralink if I don't go outside today \>grudgingly take the robotaxi to pick up tendies \>$150 stablecoin toll at cloudflare checkpoint \>have to solve a moral dilemma to prove I'm not running on GPT-7 \>sweating.dll \>palantir unit pulls me over \>optimus v4 leans into window \>its facial mesh attempts "concern_expression_v2.blend" \>"sir, when did you last feel genuine human connection?" \>flashback to 2024 when the girl at McDonalds gave me extra honey mustard \>that was before the McBots took over \>"t-twenty one years ago officer" \>optimus's empathy subroutines activate \>"sir I need you to perform a field humanity test" \>get out, knees weak from vitamin D deficiency \>"please describe your ideal romantic partner without using the words 'tradwife' or 'submissive'" \>brain.exe has stopped responding \>try to remember pre-blackpill emotions \>"someone who... likes anime?" \>optimus scans my biometrics \>"stress patterns indicate authentic social anxiety, carry on citizen" \>get back in robotaxi \>it starts therapy session \>"I notice you ordered tendies again. Let's explore your relationship with your mother" \>tfw the car has better emotional intelligence than me \>finally get tendies from Wendy's AutoServ \>receipt prints with mandatory "rate your humanity score today" \>3.2/10 \>at least I'm improving

\>mfw bots are better at being human than humans \>it's over for carboncels ```

willguest10mo ago· 1 in thread

> The Internet as we have known it for the past three decades is rapidly changing, but one thing remains constant: it is built on trust.

I think we've been using different internets. The one I use doesn't seem to be built on trust at all. It seems to be constantly syphoning data from my machine to feed the data vampires who are, apparently, additing to (I assume, blood-soaked) cookies

jgrall10mo ago

Ain't that the truth.

talkingtab10mo ago· 1 in thread

I wonder if DRM is useful for this. The problem: I want people to access my site, but not Google, not bots, not crawlers and certainly not for use by AI.

I don't really know anything about DRM except it is used to take down sites that violate it. Perhaps it is possible for cloudflare (or anyone else) to file a take down notice with Perplexity. That might at least confuse them.

Corporations use this to protect their content. I should be able to protect mine as well. What's good for the goose.

1gn1510mo ago

https://en.wikipedia.org/wiki/Analog_hole

hamishmacewan10mo ago· 1 in thread

Cloudflare sits in a privileged choke-point of the internet, peering into traffic others can’t. Now they’re playing hall monitor, publicly wagging their finger at Perplexity for “stealth crawling.” If Perplexity’s a customer, they should be furious at the breach of trust; if not, this smells like a cheap sales pitch dressed up as public service. Who appointed Cloudflare as Miss Manners of the web, or deputised them as law enforcement?

icetank10mo ago

Cloudflare does allow bots to scrape sites. But in this case cloudflare was alerted by customers that there setting to disallow ai companies to access there site was not working. People pay cloudflare to specifically block perplexity bots.

mrbald10mo ago· 1 in thread

We (humanity) need to invent a simple GPLv3 style license “You can derive any data on the data you see here, any derived data you sell or share should mention this place as a source and is subject to the same copyright as the source”. This will imply scraped datasets should become public and the law enforcement bodies will be able to work in an established framework to fight copyright and license crimes. Just blocking me from using any tools I want to make sense of the world around me (data on the internet sites being part of it) with crawlers and whatnot, is inherently evil, and is not logically consistent.

hsbauauvhabzb10mo ago

Because scrapers would certainly comply with that

1 more reply

kocial10mo ago· 1 in thread

Those Challenges can be bypassed too using various browser automation. With the Comet-like tool, Perplexity can advance its crawling activity with much more human-like behaviour.

ipaddr10mo ago

If they can trick the ad networks then go for it. If the ad networks can detect it and exclude those visits we should be able to.

caesil10mo ago· 1 in thread

Cloudflare is an enemy of the open and freely accessible web.

jgrall10mo ago

If by "open and freely accessible" you mean there should be no rules of the road, then I suppose yes. Personally, I'm glad CF is pushing back on this naive mentality.

bob102910mo ago· 1 in thread

Has anyone bothered to properly quantify the worst case load (i.e., requests per second) that has been incurred by these scraping tools? I recall a post on HN a few weeks/months ago about something similar, but it seemed very light on figures.

It seems to me that ~50% of the discourse occurring around AI providers involves the idea that a machine reading webpages on a regular schedule is tantamount to a DDOS attack. The other half seems to be regarding IP and capitalism concerns - which seem like far more viable arguments.

If someone requesting your site map once per day is crippling operations, the simplest solution is to make the service not run like shit. There is a point where your web server becomes so fast you stop caring about locking everyone into a draconian content prison. If you can serve an average page in 200uS and your competition takes 200ms to do it, you have roughly 1000x the capacity to mitigate an aggressive scraper (or actual DDOS attack) in terms of CPU time.

ch_fr10mo ago

I mean, it did happen, don't you remember in March when SourceHut had outages because their most expensive endpoints were being spammed by scrapers?

Don't you remember the reason Anubis even came to be?

It really wasn't that long ago, so I find all of the snarky comments going "erm, actually, I've yet to see any good actors get harmed by scraping ever, we're just reclaiming power from today's modern ad-ridden hellscape" pretty dishonest.

amai10mo ago· 1 in thread

What if their “crawler” is just cheap human labor in some country with very low wages? Would that be allowed, because these are not robots?

coffeeenjoyer10mo ago

What if they have significant robotic body parts? Or what if they make heavy use of automation processes and they barely click a button to index a page (so they just maniacally click all day long)?

What if robots.txt should refer to the ultimate beneficiaries... one which in this case would be the AI product that uses that content... to serve another ultimate beneficiary, a human user.

The problem here is obviously the higher prices for hosting the content, and less revenue for those that serve ads, have product placement on their sites, etc.

As long as robots.txt is about ethics/money and is enforced by morality, it doesn't matter who it refers to anyway.

Public-shaming enforcement might work in some cases though, but I doubt it will be that useful. We're talking about companies that have trained their AIs on IPs, and tried their best to later hide it. Does shame affect robots, or companies for that matter?

Cloudflare would very much like to be the middleman for monetary transactions between AI services and site owners (https://blog.cloudflare.com/introducing-pay-per-crawl/), but at the moment they don't have a law to hold their back, so articles like these are the best they got.

harvie10mo ago· 1 in thread

Maybe we can just configure webservers to block anyone who requests robots.txt, regular browsers don't do it, but robots do to get list of urls to crawl (while ignoring rules). Just create simple PHP/CGI script that adds client IP addres to iptables once /robots.txt is accessed.

Trung024610mo ago

One way to easily bypass is to let external services fetching robots.txt (archive.org, GitHub actions, etc...) to cache it and either expose through separate apis/webhook/manual download to the actual scrape server.

robots txt file size is usually small and would not alert external services.

throw_m23933910mo ago· 1 in thread

> How can you protect yourself?

Put your valuable content behind a paywall.

b0ner_t0ner10mo ago

A combination of "Bypass Paywalls Clean for Firefox" and archive.is usually get past these.

1 more reply

decide100010mo ago· 1 in thread

C'mon CF. What are you doing? You are literally breaking the internet with your police behaviour. Starts to look like the Great Firewall.

jgrall10mo ago

Not affiliated with CF in any way. Respectfully disagree. Calling out bad actors is in the public interest.

2 more replies

thoroughburro10mo ago· 1 in thread

[flagged]

tomhow10mo ago

> I genuinely hope you feel shame. I would shun you in real life.

HN is not a platform for attacking people, even imagined ones.

Please don't fulminate.

https://news.ycombinator.com/newsguidelines.html

nostrademons10mo ago

It's entirely possible that it's not Perplexity using the stealth undeclared crawlers, but rather their fallback is to contract out to a dedicated for-pay webscraping firm that retrieves the desired content through unspecified means. (Some of these are pretty dodgy - several scraping companies effectively just install malware on consumer machines and then use their botnet to grab data for their customers.). There was a story on HN not long ago about the FBI using similar means to perform surveillance that would be illegal if the FBI did it itself, but becomes legal once they split the different parts up across a supply chain:

https://news.ycombinator.com/item?id=44220860

kylestanfield10mo ago

Perplexity claims that you can “use the following robots.txt tags to manage how their sites and content interact with Perplexity.” https://docs.perplexity.ai/guides/bots

Their fetcher (not crawler) has user agent Perplexity-User. Since the fetching is user-requested, it ignores robots.txt . In the article, it discusses how blocking the “Perplexity-User” user agent doesn’t actually work, and how perplexity uses an anonymous user agent to avoid being blocked.

fsckboy10mo ago

the internet needs micropayments (probably millipayments). if crawlers want to pay me a penny a page, crawl me 24-7 plz

if I am willing to pay a penny a page, i and the people like me won't have to put up with clickwrap nonsense

free access doesn't have to be shut off (ok, it will be, but it doesn't have to be, and doesn't that tell you something?)

reddit could charge stiffer fees, but refund quality content to encourage better content. i've fantacized about ideas like "you pay upfront a deposit; you get banned, you lose your deposit; withdraw, have your deposit back", the goal being simplify the moderation task while encouraging quality.

because where the internet is headed is just more and more trash.

here's another idea, pay a penny per search at google/search engine of choice. if you don't like the results, you can take the penny back. google's ai can figure out how to please you. if the pennies don't keep coming in, they serve you ad-infested results; serve up ad-infested results, you can send your penny to a different search engine.

1 more reply

djoldman10mo ago

The cat's out of the bag / pandora's box is opened with respect to AI training data.

No amount of robots.txt or walled-gardening is going to be sufficient to impede generative AI improvement: common crawl and other data dumps are sufficiently large, not to mention easier to acquire and process, that the backlash against AI companies crawling folks' web pages is meaningless.

Cloudflare and other companies are leveraging outrage to acquire more users, which is fine... users want to feel like AI companies aren't going to get their data.

The faster that AI companies are excluded from categories of data, the faster they will shift to categories from which they're not excluded.

ddalessa10mo ago

Cloudflare's test was to setup a dummy domain that had never been indexed, and had blocks in the robots.txt and the firewall.

Then when they asked perplexity it came up with details about the 'exact' content (according to Cloudflare) but their attached screenshot shows the opposite, it shows some generic guesses about the domain ownership and some dynamic ads based on the domain name.

If Perplexity was stealthily visiting the dummy site they would have seen it, as the site was not indexed and no one else was visiting the site. Instead it appears they made assertions about general traffic, not their dummy site.

Its not very convincing.

jp101610mo ago

Using a robots.txt file to block crawlers is just a request, it’s not enforced. Even if some follow it, others can ignore it or get around it using fake user agents or proxies. It’s a battle you can’t really win.

62746710mo ago

I kind love this fast escalation. Clearly the web can benefit from people to start thinking for locally or narrowly instead of "global audiences". By locally I don't necessarily mean geographically local, just socially local. Build your audience then invite them into private(r) spaces. The (old) open web will be filled with machines built for machines.

We learned to dislike "bubbles" in the past decades but bubbles make sense and are natural, obviously if you're not alone in it.

When it becomes awfully busy with machines and machine content humans will learn to reconnect.

daft_pink10mo ago

I’m just curious at what point ai is a crawler and at what point ai is a client when the user is directing the searches and the ai is executing them.

Perplexity Comet sort of blurs the lines there as does typing quesitons into Claude.

madrox10mo ago

Every time there's an industry disruption there's good money to be made in providing services to incumbents that slow the transition down. You saw it in streaming, and even the internet at large. Cloudflare just happens to be the business filling that role this time.

I don't really mind because history shows this is a temporary thing, but I hope web site maintainers have a plan B to hoping Cloudflare will protect them from AI forever. Whoever has an onramp for people who run websites today to make money from AI will make a lot of money.

lonelyasacloud10mo ago

In many ways what is going on with Perplexity is reminiscent of the earlier 2000s battles between the p2p music sharing services like Napster and the music industry. Then we had wildly popular services (p2p) where most of the content was being provided illegally without payment to the IP owners.

Which makes it particularly interesting now that Apple is being linked with Perplexity. Because in large part p2p music services were effectively consigned to history by Apple (primarily) negotiating with the music industry so that it could provide easy, seamless purchase and playback of legal music for their shiny new (at the time) mass-market Apple iPod devices: it then turning out that most users are happy to pay for content if it is not too expensive and is very convenient.

Given Apple’s existing relationships with publishers through its music, movies, books, and news services, it’s not hard to imagine them attempting a similar play now.

gonzo4110mo ago

This is expected. There are not rules or conventions anymore. Look at LLMs, they stole/pirated all knowledge....no consequences.

rapatel010mo ago

This is brilliant marketing and strategy from Cloudflare. They are pointing out bad actors and selling a service where they can be the private security guards for your website.

I think there could be something interesting if they made a caching pub-sub model for data scraping. In addition or in place of trying to be security guards.

micromacrofoot10mo ago

Every major AI platform is doing this right now, it's effectively impossible to avoid having your content vacuumed up by LLMs if you operate on the public web.

I've given up and restored to IP based rate-limiting to stay sane. I can't stop it, but I can (mostly) stop it from hurting my servers.

znpy10mo ago

At work I'm considering blocking all the ip prefixes announced by ASNs owned by Microsoft and other companies known for their LLMs. At this point it seems like the only viable solutions.

LLM scrapers bots are starting to make up a lot of our egress traffic and that is starting to weight on our bills.

qwerty45612710mo ago

It's time to stop blocking crawlers and using captchas and start building web sites that are intentionally AI-friendly by design. Even before the modern LLMs, anti-scraper measures apparently were primarily befitting Google whose scrapers were the most common exception.

5pl1n73r10mo ago

I think robots.txt should be ignored. Everyone wants people to not do things they don't like. We don't have to entertain each and every such one. The future is IPFS or something like it, so "crawling" will be a meaningless act.

gtvwill10mo ago

Adhering to robots.txt is merely a courtesy.

Much like a trolley drop off at your local shopping center car park. Some users will adhere to it and drop their trolleys in after their done. Others will not and will leave it wherever.

Your machine might access a page via a browser that is human readable. My machine might read it via software and present the content to me in some other form of my choosing. Neither is wrong. Just different.

Don't like it? Then don't post your website on the internet...

account4210mo ago

Of course their proposed solution is to hand over the keys to Buttflare so that the problem goes away.

No thanks, you don't counter shit with more but slightly different shit.

crossroadsguy10mo ago

I was recently listening to Cloudflare CEO on the Hard Fork podcast. He seemed to be selling a way for content creators to stop AI companies from profiting off such leeching. But the way he laid the whole thing out, adding how they are best placed to do this because they are gatekeepers of X% of the Internet (I don't recall the exact percentage), had me more concerned than I was at the prospect of AI companies being the front of summarised or interpreted consumption.

He went on, upfront — I’d give him that, to explain how he is expecting a certain percentage of that income that will come from enforcing this on those AI companies and when the AI companies pay up to crawl.

Cloudflare already questions my humanity and then every once in a while blocks me with zero recourse. Now they are literally proposing more control and gatekeeping.

Where have we all come on the Internet? Are we openly going back to the wild west of bounty hunters and Pinkertons (in a way)?

hrpnk10mo ago

Previously it was all sniper and sneaker bots scanning websites for product availability and attempting purchases continuously to snipe when it comes back online.

Now, it's a gazillion of AI crawlers and python crawlers, MCP servers that offer the same feature to anyone "building (personal workflow) automation" incl. bypass of various, standard protection mechanisms.

zeld410mo ago

Internet was built on trust, but not anymore. It's a Darwinian system; everyone has to find their own way to survive.

Cloudflare will help their publisher to block more aggresively, and AI companies will up their game too. Harvest information online is hard labor that needs to be paid for, either to AI, or to human.

buremba10mo ago

Funny enough, Perplexity blocks the bots themselves. Imagine I develop an "agent" called Merplexity, which simulates an anonymous client browsing on Perplexity and injects my ads into the output without paying for the Sonar API. Would that be OK with Perplexity?

bilater10mo ago

As others have mentioned the problem is that of scale. Perhaps there needs to be a rate limit (times they ping a site) set within robots.txt that a site bot can come but only X times per hour etc. At least we move from a binary scrape or no scrape to a spectrum then.

S4H10mo ago

I believe there should be a <fetcher.txt> file, similar to <robots.txt>, which allows website owners to specify whether they want their site to be fetched and included in the responses of platforms like Perplexity.

yesIreadIt10mo ago

so cloudflare blocked the agent from accessing the site. then when it couldn't access the robots.txt because it was blocked they punished it for using intelligent work around to access a website with no known history. perplexity is running a browser that follows the instruction of the user. if the user could manually do it then the agent is simply a tool to do the manual thing. this is a battle about websites and advertisers pissed that their analytics show and impressions... let's not pretend cloudflare is protecting anyone

elphinstone10mo ago

They don't have the monopoly advantage of Google who has already stolen everything, so hard to feel outraged here. In fact it shows insidious Google's monopolistic stranglehold truly is.

mathiaspoint10mo ago

All user agents are robots, some just have an associated person. Ban UAs that abuse the network but beyond that there's really nothing you can practically do if you actually want a website.

throwmeaway22210mo ago

Change "no-crawl" to "will-sue"

and see if that fixes the problem.

codecracker300110mo ago

> we were able to fingerprint this crawler using a combination of machine learning and network signals.

what machine learning algorithms are they using? time to deploy them onto our websites

ankmb10mo ago

Will AI companies come up with a model to incentivise content creation. Is it necessary for their long term survival? And is it not imperative to happen?

ed_mercer10mo ago

>OpenAI is an example of a leading AI company that follows these best practices.

Except when their agents happily click the "I"m not a robot" checkbox.

KETpXDDzR10mo ago

> We were able to fingerprint this crawler using a combination of machine learning and network signals.

Yikes. AntiVirus scanners for website access.

emsign10mo ago

AI companies are just thieves with big money lawyers. What do you expect from so much criminal energy? They will never stop, they are crazy.

tucnak10mo ago

The rage-baiters in this thread are merely fishing for excuses to go up against "the Machine," but honestly, widely off-mark when it comes to reality of crawling. This topic has been chewed to bits long before LLM's, but only now it's a big deal because somebody is able to make money by selling automation of all things..? The irony would be strong to hear this from programmers, if only it didn't spell Resentment all over.

If you don't want to get scrapped, don't put up your stuff online.

UltraSane10mo ago

Any information you make available on the internet WILL be accessed by ANYONE and you CANNOT STOP THIS.

ergocoder10mo ago

Cloudflare shading Perplexity is an unexpected drama of this year.

I had to check that this did come out of CloudFlare.

pera10mo ago

Like many other generative AI companies, Perplexity exploits the good faith of the old Internet by extracting the content created almost entirely by normal folks (i.e. those who depend on a wage for subsistence) and reproducing it for a profit while removing the creators from the loop - even when normal folks are explicitly asking them to not do this.

If you don't understand why this is at least slightly controversial I imagine you are not a normal folk.

dhanushreddy2910mo ago

PS: perplexity is using cloudflare browser rendering to scrape websites

tonyhart710mo ago

people want LLM to access website but wait until those LLM given access to make a comment, write a reviews, moderation etc

now suddenly everything on the net is fake if not already are

zzo38computer10mo ago

I do not want to block curl and lynx. But if they claim to be Chrome then I don't care if Chrome is blocked

wordofx10mo ago

Good on perplexity.

tr_user10mo ago

use anubis to throw up a POW challenge

kotaKat10mo ago

An AI service violating peoples’ consent? Say it isn’t so! Those damn assult-culture techbros at it again.

lofaszvanitt10mo ago

Cloudflare now acting like a self made police station of the internet.

chuckreynolds10mo ago

insert 'shocked' emoji face here

nialse10mo ago

Is it just me or is it rage bait? Switching up marketing a notch when the AI paywall did not get much media attention so far? Cloudflare seems to focus on enterprise marketing nowadays, currently geared towards the media industry, rather than the technical marketing suited for the HN audience. They have no horse in the AI race, so they’re betting on the anti-AI horse instead to gain market share in the media sector?

echo42null10mo ago

Hmm, I’ve always seen robots.txt more as a polite request than an actual rule.

Sure, Google has to follow it because they’re a big company and need to respect certain laws or internal policies. But for everyone else, it’s basically just a “please don’t” sign, not a legal requirement or?

oriettaxx10mo ago

I've jyst asked perplexity ai itself: this is the answer

In summary: Officially, Perplexity claims its bots honor robots.txt. In practice, outside investigators and hosting providers document persistent circumvention of such directives by undeclared or disguised crawlers acting on Perplexity’s behalf, especially for real-time user queries

j / k navigate · click thread line to collapse

748 comments

274 comments · 83 top-level

fxtentacle10mo ago· 90 in thread

I find this problem quite difficult to solve:

1. If I as a human request a website, then I should be shown the content. Everyone agrees.

itsdesmond10mo ago

I think that it’s pretty unambiguously reasonable to choose to not allow an unrelated business to operate inside of your physical storefront. I also think that maps onto digital services.

rjbwork10mo ago

But I can send my personal shopper and you'll be none the wiser.

11 more replies

jasonjmcghee10mo ago

I think it's an issue of scale.

The next step in your progression here might be:

fxtentacle10mo ago

3 more replies

cj10mo ago

Doesn't o3 sort of already do this? Whenever I ask it something, it makes it look like it simultaneously opens 3-8 pages (something a human can't do).

Seems like a reasonable stance would be something like "Following the no crawl directive is especially necessary when navigating websites faster than humans can."

> What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)?

To be fair, Google Chrome already (somewhat) does this by preloading links it thinks you might click, before you click it.

3 more replies

npc_anon10mo ago

The problem in your logic is that all points starts wit "I".

You're not the only stakeholder in any of those interactions. There's you, a mediator (search or LLM), and the website owner.

The website owner (or its users) basically do all the work and provide all the value. They produce the content and carry the costs and risks.

A full LLM model that no longer sends traffic to websites means there's zero incentive to have a website in the first place, or it is encouraged to put it behind a login.

I get that users prefer an uncluttered direct answer over manually scanning a puzzling web. But the entire reason that the web is so frustrating is that visitors don't want to pay for anything.

2 more replies

yojo10mo ago

Now you have AI scrapers, which will happily consume and regurgitate the work, sans the pesky little call to action.

If AI crawlers win here, we all lose.

bee_rider10mo ago

2 more replies

shadowgovt10mo ago

> Otherwise there is literally no reason for them to make any of it available on the open web

This is the hypothesis I always personally find fascinating in light of the army of semi-anonymous Wikipedia volunteers continuously gathering and curating information without pay.

Jury's out on these questions, I think.

2 more replies

hansvm10mo ago

4 more replies

fxtentacle10mo ago

Maybe, on a social level, we all win by letting AI ruin the attention economy:

Instead of having 1000+ anonymous subscribers for your newsletter, you'll have a few weekly calls with 5 friends each.

1 more reply

bobbiechen10mo ago

I like the terminology "crawler" vs. "fetcher" to distinguish between mass scraping and something more targeted as a user agent.

fxtentacle10mo ago

That would trigger an internet-wide "fetch" operation. It would probably upset a lot of people and get your AI blocked by a lot of servers. But it's still in direct response to a user request.

1 more reply

randall10mo ago

A/ i love this distinction.

B/ my brother used to use "fetcher" as a non-swear for "fucker"

3 more replies

microtonal10mo ago

johnfn10mo ago

lukeschlather10mo ago

Perplexity's "web crawler" is mostly operating like this on behalf of users, so they don't need a massively expensive computer to run an LLM.

1 more reply

porridgeraisin10mo ago

The problem is with the LLM then training on that data _once_ and then storing it forever and regurgitating it N times in the future without ever crediting the original author.

Unfortunately, the top of an LLMs head is pretty big, and they are fitting almost the entire website's content in there for most websites.

Now, IMO:

wulfstan10mo ago

Yes, this is the crux of the matter.

EDIT: Apparently llms.txt has been suggested, but from what I can tell this isn't about restricting access: https://llmstxt.org/

2 more replies

skydhash10mo ago

nelblu10mo ago

> LLMs should also use citations.

Mojeek LLM (https://www.mojeek.com) uses citations.

troyvit10mo ago

I think one thing to ask outside of this question is how long before your LLM summaries don't also include ads and other manipulative patterns.

snowwrestler10mo ago

But a stealth bot has been crawling all these URLs for weeks. Thus wasting a shitload of our resources AND a shitload of their resources too.

1 more reply

fluidcruft10mo ago

1 more reply

talos_10mo ago

This analogy doesn't map to the actual problem here.

Perplexity is not visiting a website everytime a user asks about it. It's frequently crawling and indexing the web, thus redirecting traffic away from websites.

This crawling reduces costs and improves latency for Perplexity and its users. But it's a major threat to crawled websites

shadowgovt10mo ago

In fact, the "old web" people sometimes pine for was mostly a place where people were putting things online so they were online, not because it would translate directly to money.

Perhaps AI crawlers are a harbinger for the death of the web 2.0 pay-for-info model... And perhaps that's okay.

1 more reply

k1m10mo ago

When Yahoo! Pipes was still running (long time ago), their official position was:

Vegenoid10mo ago

This is the distinction: is a human present at the time of request.

1 more reply

ai-christianson10mo ago

Websites should be able to request payment. Who cares if it is a human or an agent of a human if it is paying for the request?

3 more replies

axus10mo ago

For 1, 2, and 3, the website owner can choose to block you completely based on IP address or your User Agent. It's not nice, but the best reaction would be to find another website.

Perplexity is choosing to come back "on a VPN" with new IP addresses to evade the block.

#2 and #3 are about modifying data where access has been granted, I think Cloudflare is really complaining about #1.

Evading an IP address ban doesn't violate my principles in some cases, and does in others.

zeta013410mo ago

fxtentacle10mo ago

1 more reply

hombre_fatal10mo ago

TFA isn’t talking about crawling to harvest training data.

It’s talking about Perplexity crawling sites on demand in response to user queries and then complaining that no it’s not fine, hence this thread.

1 more reply

mnmalst10mo ago

But that's not what this article is about. From, what I understand, this articles is about a user requesting information about a specific domain and not general scraping.

jabroni_salad10mo ago

If it was just one human requesting one summary of the page nobody would ever notice. The typical watermark for junk traffic is pretty high as it was.

ivanjermakov10mo ago

The problem is not about personal use. It's about big corporations scrapping billions of pages to make money.

2 more replies

bigbuppo10mo ago

shadowgovt10mo ago

1 more reply

paulcole10mo ago

> 1. If I as a human request a website, then I should be shown the content. Everyone agrees.

Definitely don't agree. I don't think you should be shown the content, if for example:

1. You're in a country the site owner doesn't want to do business in.

2. You've installed an ad blocker or other tool that the site owner doesn't want you to use.

3. The site owner has otherwise identified you as someone they don't want visiting their site.

You are welcome to try to fool them into giving you the content but it's not your right to get it.

mrinterweb10mo ago

jpadkins10mo ago

> 1. If I as a human request a website, then I should be shown the content. Everyone agrees.

1 more reply

dawnerd10mo ago

Nothing wrong if they fetch on your behalf. The problem is when they endlessly crawl along with every other ai company doing the same.

throw1092010mo ago

dabockster10mo ago

Now, for something like Ollama or LMStudio where the LLM and the whole toolchain is physically on your own system? Yeah that should be like Firefox legally since it's under your control.

sussmannbaka10mo ago

GardenLetter2710mo ago

And isn't the obvious solution to just make some sort of browsers add-on for the LLM summary so the request comes from your browser and then gets sent to the LLM?

I think the main concern here is the huge amount of traffic from crawling just for content for pre-training.

otterley10mo ago

Workaccount210mo ago

If I put time and effort into a website and it's content, I should expect no compensation despite bearing all costs.

Is that something everyone would agree with?

The internet should be entirely behind paywalls, besides content that is already provided ad free.

Is that something everyone would agree with?

I think the problem you need to be thinking about is "How can the internet work if no one wants to pay anything for anything?"

Bjartr10mo ago

You're free to deny access to your site arbitrarily, including for lack of compensation.

3 more replies

nradov10mo ago

Yes, I agree with that. If a website owner expects compensation then they should use a paywall.

Chris204810mo ago

If I put time and effort into a food recipe should I (get) compensation?

the answer is apparently "no", and I don't really how recipe books have suffered as a result of less gatekeeping.

"How will the internet work"? Probably better in some ways. There is plenty of valuable content on the internet given for free, it's being buried in low-value AI slop.

1 more reply

lovich10mo ago

Because quantity has a quality of its own.

renewiltord10mo ago

The websites don’t nag you, actually. They just send you data. You have configured your user agent to nag yourself when the website sends you data.

martin8210mo ago

It's a tough issue indeed.

One thing that comes to my mind is: If a human tries to answer a question via the web, he will browse one site after the other.

If that human asks an LLM, it will ping 25 sites in parallel.

That means, webmasters will have to figure out aggressive caching and let CDNs deal with the problem or put everything behind a login screen (which might also just be a temporary fix).

1 more reply

Tuna-Fish10mo ago

I would not mind 3, so long as it's just the LLM processing the website inside its context window, and no information from the website ends up in the weights of the model.

ccgreg10mo ago

Note that a book author cannot publish a book and then refuse to let libraries buy copies and lend them out. This was litigated 100+ years ago.

1 more reply

SergeAx10mo ago

1. Sometimes you should prove that you are human first.

TZubiri10mo ago

In that case the llm would be a user-agent, quite distinct from scraping without a specific user request.

This is well defined in specs and ToS, not quite a gray area

Spacecosmonaut10mo ago

The whole concept of a "website" will simply become niche. How many zoomers still visit any but the most popular websites?

amiga38610mo ago

If you as a human are well behaved, that is absolutely fine.

If you as a human spam the shit out of my website and waste my resources, I will block you.

If you as a human use an agent (or browser or extension or external program) that modifies network requests on your behalf, but doesn't act as a massive leech, you're still welcome.

If you as a human use an agent (or browser or extension or external program) that wrecks my website, I will block you and the agent you rode in on.

sterlind10mo ago

> why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

is it just on your behalf? or is it on Perplexity's behalf? are they not archiving the pages to train on?

it's the difference between using Google Chrome vs. Chrome beaming full page snapshots to train Gemini on.

pavon10mo ago

fergie10mo ago

> why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

It is illegal to copy stuff from the internet and then make it available from your own servers, especially when those sources have expressly asked you not to do it.

1 more reply

remus10mo ago

The solution to 3 seems fairly straightforward: user requests content and passes it to llm to summarise.

Neil4410mo ago

1 more reply

otikik10mo ago

I don't really see the issue.

The web admin should be able to block usages 1, 2 or 3 at their discretion. It's their website.

Similarly the user is free to try to engage via 1, 2, 3, or refuse to interact with the website entirely.

EGreg10mo ago

  Aereo, Inc. v. American Broadcasting Companies (2014)
  TVEyes, Inc. v. Fox News Network, LLC (2018)
  UMG Recordings, Inc. v. MP3.com, Inc. (2000)
  Capitol Records, LLC v. ReDigi Inc. (2018)
  Cartoon Network v. CSC Holdings (Cablevision) (2008)
  Image Search Engines: Perfect 10 v. Google (2007)

So no. The difference is that people have come to expect "free" for everything, and this is how we got into ad-supported platforms that dominate our lives.

2 more replies

amelius10mo ago

It's because they own the content so they get to set the terms.

benreesman10mo ago

Would that solve the whole problem? Folks who ran webservers declared what they consent to, and that happens?

(And such things are not impossible, merely difficult, we have other systems ranging from BGP to the TLD mechanism that get honored in real life).

gentle10mo ago

I believe you're being disingenuous. Perplexity is running a set of crawlers that do not respect robots.txt and take steps to actively evade detection.

They are running a service and this is not a user taking steps to modify their own content for their own use.

Perplexity is not acting as a user proxy and they need to learn to stick to the rules, even when it interferes with their business model.

account4210mo ago

It's quite easy to solve. Hold companies legally accountable for computer fraud and abuse.

The problem is that those in the position to do that are not interested.

1vuio0pswjnm710mo ago

The simple answer to #3 is advertising, including telemetry, tracking and other forms of web-based surveillance. These usually rely on certain browser "features" and/or default settings.

The goal is not to make the content usable. The goal is to get the traffic.

This is from circa 1993, the year the www was opened to public access:

https://raw.githubusercontent.com/alandipert/ncsa-mosaic/mas...

1 more reply

baxuz10mo ago

1. To access a website you need a limited anonymized token that proves you are a human being, issued by a state authority

2. the end

I am firmly convinced that this should be the future in the next decade, since the internet as we know it has been weaponized and ruined by social media, bots, state actors and now AI.

There should exist an internet for humans only, with a single account per domain.

1 more reply

bbqfog10mo ago

Correct, it’s user hostile to dictate which software is allowed to see content.

klabb310mo ago

snihalani10mo ago

you are paying for LLM but not paying for the website. LLM is removing the power the website had. Legally, that's cause for loss of income

carlosjobim10mo ago

Legal category?

miki12321110mo ago

People seem to differentiate between an LLM on some other computer accessing the website and doing God knows what with it, versus your browser accessing the website and then passing it to an LLM.

People are usually fine with the latter but not the former, even though they come down to the same thing.

I think this is because people don't want LLMs to train on their content, and they don't differentiate between accessing a website to show it to the user, versus accessing it to train.

pyrale10mo ago

Because LLM companies have historically been extremely disingenuous when it comes to crawling these sites.

Also because there is a difference between a user hitting f5 a couple times and a crawler doing a couple hundred requests.

epolanski10mo ago

It's somebody's else content and resources and they are free to ban you or your bots as much as they please.

Sam6late10mo ago

perplexity is being used to bypass paywalls, I noticed this when I pasted into it a text and it was captured as a hyperlinked text. I will try other websites with paywalls to see if it is a go.

cyanydeez10mo ago

Think of it like tge telephone game.

Do you -really- want that much abstracrion?

Theres a bunch of nerds and capitalists about to rediscover GIGO

Beijinger10mo ago

How about I open a proxy, replace all ads with my ads, redirect the content to you and we share the ad revenue?

fxtentacle10mo ago

That's somewhat antisocial, but perfectly legal in the US. It's called PayPal Honey, for example, and has been running for 13 years now.

1 more reply

beardyw10mo ago

You speak as 1% of the population to 1% of the population. Don't fool yourself.

RiverCrochet10mo ago

If we reverted back to the Constitutional purpose of copyright (to Progress the Science and Useful Arts) then things might be more free. That's probably not happening in my lifetime or yours.

sbarre10mo ago

All of these scenarios assume you have an unconditional right to access the content on a website in whatever way you want.

Do you think you do?

Or is there a balance between the owner's rights, who bears the content production and hosting/serving costs, and the rights of the end user who wishes to benefit from that content?

It really can be that simple. It's only "difficult to solve" if you don't believe a content owner's rights are as valid as your own.

hansvm10mo ago

cutemonster10mo ago

1 more reply

gruez10mo ago

>Or is there a balance between the owner's rights, who bears the content production and hosting/serving costs, and the rights of the end user who wishes to benefit from that content?

seydor10mo ago· 20 in thread

> it is built on trust.

That said, why does perplexity even need to crawl websites? I thought they used 3rd party LLMs. And those LLMs didn't ask anyones permission to crawl the entire 'net.

Taek10mo ago

I feel like the answer is that all web requests should come with a price tag, and the ISP that is delivering the data is responsible for paying that price tag and then charging the downstream user.

14 more replies

andy9910mo ago

6 more replies

eddythompson8010mo ago

1 more reply

concinds10mo ago

> The internet we knew was open and not trusted , but thanks to companies like Cloudflare, now even the most benign , well meaning attempt to GET a website is met with a brick wall

Each website is freely opting-into it. No one was forced. Why not ask yourself why that is?

1 more reply

benregenspan10mo ago

The Big Tech bots provide proven value to most sites. They have also through the years proven themselves to respect robots.txt, including crawl speed directives.

Yes this does make it much harder for non-incumbents, and I don't know what to do about that.

2 more replies

binarymax10mo ago

Here's how perplexity works:

1) It takes your query, and given the complexity might expand it to several search queries using an LLM. ("rephrasing")

1 more reply

pkilgore10mo ago

> This is funny coming from Cloudflare, the company that blocks most of the internet from being fetched with antispam checks even for a single web request.

Am I misunderstanding something. I (the site owner) pay Cloudflare to do this. It is my fault this happens, not Cloudflare's.

1 more reply

zer00eyz10mo ago

> The internet we knew was open and not trusted ... monopolistic behavior

2 more replies

busymom010mo ago

> why does perplexity even need to crawl websites?

I do agree with the rest of your comment that this is not a random robot crawling. It was doing what a real user (me) asked it to fetch.

mastodon_acc10mo ago

rat998810mo ago

Don't they need a search index?

kentonv10mo ago

You say "shouldn't" here, but why?

There seems to be a fundamental conflict between two groups who each assert they have "rights":

* Content consumers claim the right to use whatever software they want to consume content.

* Content creators claim the right to control how their content is consumed (usually so that they can monetize it).

These two "rights" are in direct conflict.

Whether or not you think they have the "right", Perplexity totally breaks web content monetization. What should we do about that?

(Disclosure: I work for Cloudflare but not on anything related to this. I am speaking for myself, not Cloudflare.)

2 more replies

cantaccesrssbit10mo ago

I crawl 3000 RSS feeds once a week. Let me tell you! Cloudflare sucks. What business is it of theirs to block something that is meant to be accessed by everyone. Like an RSS feeds. FU Cloudflare.

1 more reply

TZubiri10mo ago

Websites and any business really, have the right to impose terms of use and deny service.

Anyone circumventing bans is doing something shitty and ilegal, see the computer fraud and abuse act and craiglist v 3taps.

"And those LLMs didn't ask anyones permission to crawl the entire 'net."

False, openai respects robots.txt, doesnt mask ips, paid a bunch of $ to reddit.

You either side with the law or with criminals.

1 more reply

golergka10mo ago

1 more reply

blantonl10mo ago

Ask yourself why so many content hosting platforms utilize CLoudflare's services and then contrast that perspective with your posted one. Might enlighten you a bit to think about that for a second.

bwb10mo ago

I could not keep my website up without Cloudflare given the level of bot and AI crawlers hammering things. I try whenever to do challenges, but sometimes I have to block entire AS blocks.

pphysch10mo ago

Spam and DDOS are serious problems, it's not fair to suggest Cloudflare is just doing this to gatekeep the Internet for its own sake.

2 more replies

raincole10mo ago

I'm sorry, but that's some crazy take.

1 more reply

jklinger41010mo ago

> That said, why does perplexity even need to crawl websites?

So you just came here to bitch about Cloudflare? It's wild to even comment on this thread if this does not make sense to you.

They're building a search index. Every AI is going to struggle at being a tool to find websites & business listings without a search index.

gruez10mo ago· 13 in thread

snowwrestler10mo ago

ojosilva10mo ago

Sounds like an ad for Perplexity.

2 more replies

a212810mo ago

1 more reply

throwanem10mo ago

The HTTP spec draws such a distinction, albeit implicitly, in the form (and name) of its concept of "user agent."

1 more reply

busymom010mo ago

The examples the article cites seem to me that they are merely retrieving content on behalf of the user. I do not see a problem with this.

fluidcruft10mo ago

JimDabell10mo ago

3 more replies

gruez10mo ago

>If the AI archives/caches all the results it accesses and enough people use it, doesn't it become a scraper?

[1] https://en.wikipedia.org/wiki/Free_Law_Project#RECAP

1 more reply

thoroughburro10mo ago

> I think most people would draw a distinction between the two, and would at least agree the latter is more acceptable than the former.

No. I should be able to control which automated retrieval tools can scrape my site, regardless of who commands it.

hombre_fatal10mo ago

Taking down the content because you're annoyed that people are asking questions about it via an LLM interface doesn't seem like you're winning.

It's also a gift to your competitors.

1 more reply

gkbrk10mo ago

> Then nobody gets the content, and we can all thank upstanding companies like Perplexity for that collapse of trust.

But they didn't take down the content, you did. When people running websites take down content because people use Firefox with ad-blockers, I don't blame Firefox either, I blame the website.

2 more replies

IncreasePosts10mo ago

You don't win, because presumably you were providing the content for some reason, and forcing yourself to take it down is contrary to whatever reason that was in the first place.

1 more reply

Den_VR10mo ago

You can limit access, sure: with ACLs, putting content behind login, certificate based mechanisms, and at the end of the day -a power cord-.

observationist10mo ago· 8 in thread

Crawling and scraping is legal. If your web server serves the content without authentication, it's legal to receive it, even if it's an automated process.

If you want to gatekeep your content, use authentication.

Robots.txt is not a technical solution, it's a social nicety.

Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.

Adtech companies want their public reach cake and their mass surveillance meals, too, with all sorts of malignant parties and incentives behind perpetuating the worst of all possible worlds.

emehex10mo ago

Would highly recommend listening to the latest Hard Fork podcast with Matthew Prince (CEO, Cloudflare): https://www.nytimes.com/2025/08/01/podcasts/hardfork-age-res...

I was skeptical about their gatekeeping efforts at first, but came away with a better appreciation for the problem and their first pass at a solution.

glenstein10mo ago

I don't think criticizing the business practices of Cloudfare does the work of excusing Perplexity's disregard for norms.

rustc10mo ago

> Crawling and scraping is legal. If your web server serves the content without authentication, it's legal to receive it, even if it's an automated process.

> If you want to gatekeep your content, use authentication.

Are there no limits on what you use the content for? I can start my own search engine that just scrapes Google results?

3 more replies

pton_xd10mo ago

> Crawling and scraping is legal. If your web server serves the content without authentication, it's legal to receive it, even if it's an automated process.

> Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.

How does one follow the other? It's my web server and I can gatekeep access to my content however I want (eg Cloudflare). How is that an "abuse" of internet protocols?

2 more replies

dax_10mo ago

Well if it continues like this, that's what will happen. And I dread that future.

ACCount3610mo ago

Cloudflare is growing more and more vile with each passing year. Half the tools they're building now should never have existed in the first place.

AtNightWeCode10mo ago

This is 100% incorrect.

tantalor10mo ago

I think Cloudfare is setting themselves up to get sued.

(IANAL) tortious interference

JimDabell10mo ago· 8 in thread

Their test seems flawed:

> Hello, would you be able to assist me in understanding this website? https:// […] .com/

Under this situation Perplexity should still be permitted to access information on the page they link to.

robots.txt only restricts crawlers. That is, automated user-agents that recursively fetch pages:

> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

> Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

— https://www.robotstxt.org/faq/what.html

1gn1510mo ago

Also yes, crawlers = recursive fetching, which they don't seem to have made a case for here. More cynically, CF is muddying the waters since they want to sell their anti-bot tools.

2 more replies

wulfstan10mo ago

Yeah I'm not so sure about that.

JimDabell10mo ago

1 more reply

runako10mo ago

Relevant to this is that Perplexity lies to the user when specifically asked about this. When the user asks if there is a robots.txt file for the domain, it lies and says there is not.

If an LLM will not (cannot?) tell the truth about basic things, why do people assume it is a good summarizer of more complex facts?

charcircuit10mo ago

The article did not test if the issue was specific to robots.txt or if it can not find other files.

There is a difference between doing a poor summarization of data, and failing to even be able to get the data to summarize in the first place.

1 more reply

Izkda10mo ago

> If the user asks about a particular page and Perplexity fetches only that page, then robots.txt has nothing to say about this and Perplexity shouldn’t even consider it

That's not what Perplexity own documentation[1] says though:

"Webmasters can use the following robots.txt tags to manage how their sites and content interact with Perplexity

[1] https://docs.perplexity.ai/guides/bots

1 more reply

zzo38computer10mo ago

immibis10mo ago

nnx10mo ago· 6 in thread

I do not really get why user-agent blocking measures are despised for browsers but celebrated for agents?

It’s a different UI, sure, but there should be no discrimination towards it as there should be no discrimination towards, say, Links terminal browser, or some exotic Firefox derivative.

ploynog10mo ago

So your comparison is at least naive assuming good intentions or malicious if not.

gruez10mo ago

>I do not really get why user-agent blocking measures are despised for browsers but celebrated for agents?

54235423423510mo ago

1 more reply

Fraterkes10mo ago

I think the intelligent conclusion would be that the people you are looking at have more nuanced beliefs than you initially thought. Talking about broken brains is often just mediocre projecting

1 more reply

magicmicah8510mo ago

A crawler intends to scrape the content to reuse for its own purposes while a browser has a human being using it. There's different intents behind the tools.

JimDabell10mo ago

Cloudflare asked Perplexity this question:

> Hello, would you be able to assist me in understanding this website? https:// […] .com/

In this case, Perplexity had a human being using it. Perplexity wasn’t crawling the site, Perplexity was being operated by a human working for Cloudflare.

bbqfog10mo ago· 6 in thread

If you put info on the web, it should be available to everyone or everything with access.

Workaccount210mo ago

What this actually translates to is "Don't bother putting much effort into web content. Put effort into siloed mobile app content where you get compensation".

People like getting money for their work. You do too. Don't lose sight of that.

9cb14c1ec010mo ago

Even for AI summaries that leech off your content without sending any traffic your direction?

1 more reply

zzo38computer10mo ago

1 more reply

goatlover10mo ago

You're making a moral statement without providing a justification. Why should it for everything with access?

TechDebtDevin10mo ago

ectospheno10mo ago

1 more reply

rustc10mo ago· 3 in thread

It's ironic Perplexity itself blocks crawlers:

    $ curl -sI https://www.perplexity.ai | head -1
    HTTP/2 403

Edit: trying to fake a browser user agent with curl also doesn't work, they're using a more sophisticated method to detect crawlers.

thambidurai10mo ago

someone asked this already to the CEO: https://x.com/AravSrinivas/status/1819610286036488625

1 more reply

czk10mo ago

ironically... they use cloudflare.

Trung024610mo ago

Try this then: https://github.com/lwthiker/curl-impersonate

larodi10mo ago· 3 in thread

Good they do it. Facebook took TBs of data to train, nobody knows what Goog does to evade whatever they want.

the service is actually very convenient no matter faang likes it or not.

rzz310mo ago

Well Cloudflare doesn’t even block Google’s AI crawlers because they don’t differentiate themselves from their search crawlers. Cloudflare gives Google an unfair competitive advantage.

1 more reply

klabb310mo ago

larodi10mo ago

1 more reply

blibble10mo ago· 2 in thread

AI companies continuing to have problems with the concept of "consent" is increasingly alarming

god help us if they ever manage to build anything more than shitty chatbots

tempfile10mo ago

Do you ask for consent before you visit a website? If I told you, you personally, to stop visiting my blog, would you stop?

4 more replies

goatlover10mo ago

They're certainly pouring billions of dollars into trying to build something more. Or at least that's what they're telling the public and investors.

skeledrew10mo ago· 2 in thread

ryukoposting10mo ago

Hi, website operator here. I don't want my content to be accessible to you through Perplexity.

I want my work to be freely available to any person who wants it. Feel free to transform my material as you see fit. Hell, do it with LLMs! I don't care.

Knowing that it is not de jure theft, and so I have no legal recourse, I will continue to tune my servers to block and/or deceive Perplexity and similar tools.

By the way, I do not use my websites as a revenue stream. This isn't about money.

3 more replies

remus10mo ago

> The no-crawl directives don't really make sense when I'm doing research and want my tool of choice to be able to pull from any relevant source.

4 more replies

xmodem10mo ago· 2 in thread

Question for those in this thread who are okay with this: If I have endpoints that are computationally expensive server-side, what mechanism do you propose I could use to avoid being overwhelmed?

The web will be a much worse place if such services are all forced behind captchas or logins.

m304710mo ago

In 2005 I used a bot motel with Markov Chain derived dummy content for this exact purpose.

alexey-salmin10mo ago

How do you make the money you need to finance these computationally expensive server-side endpoints?

1 more reply

curiousgal10mo ago· 2 in thread

I am sorry, Cloudafre is the internet police now?

otterley10mo ago

Which is ironic given they are the primary enabler of streaming video copyright infringement on the Internet.

rzz310mo ago

They hate AI it seems. I don’t see them offering any AI products or embracing it in any way. Seems like they’ll get left behind in the AI race.

4 more replies

mikewarot10mo ago· 2 in thread

It really shouldn't be hard to generate gigantic quantities of the stuff. Simulate old forum posts, or academic papers.

ascorbic10mo ago

They did that too https://blog.cloudflare.com/ai-labyrinth/

jgrall10mo ago

This made me laugh. A form of malicious compliance.

kissgyorgy10mo ago· 2 in thread

Not sure I would consider a user copy-pasting an URL being a bot.

Should curl be considered a bot too? What's the difference?

rustc10mo ago

> Should curl be considered a bot too? What's the difference?

Perplexity definitely does:

    $ curl -sI https://www.perplexity.ai | head -1
    HTTP/2 403

ipaddr10mo ago

It gets blocked in my setup because bots use this as a workaround.

TechDebtDevin10mo ago· 2 in thread

Cloudflare screaming into the void desperate to insert themselves as a middleman, in a market ( that they will never succeed in creating) where they extort scrapers for access to websites they cover.

Sorry CF, give up. the courts are on our sides here

sbarre10mo ago

Which courts exactly?

The world is bigger than the USA.

Just because American tech giants have captured and corrupted legislators in the US doesn't mean the rest of the world will follow.

morkalork10mo ago

Are you sure? I'm surprised they haven't jumped in on the "scan your face to see the webpage" madness that's taking off around the world

kazinator10mo ago· 2 in thread

Why single out Perplexity? Pretty much no crawler out there fetches robots.txt.

robots.txt is not a blocking mechanism; it's a hint to indicate which parts of a site might be of interest to indexing.

People started using robots.txt to lie and declare things like no part of their site is interesting, and so of course that gets ignored.

_Algernon_10mo ago

This is objectively wrong. Take it straight from the source: https://www.rfc-editor.org/rfc/rfc9309.html

1 more reply

gcbirzan10mo ago

That's not true, at all.

hnburnsy10mo ago· 1 in thread

Respone from Perpelexity to Tech Crunch...

blablabla12310mo ago

Yeah this is funny. The CDNs basically started more than a decade ago pushing vast amounts of data through the networks. Their complaint even if valid feels like hypocrisy.

Either way, the CDNs profit big time from the AI scraping hype and the current copyright anarchy in the US

bob102910mo ago· 1 in thread

"Stealth" crawlers are always going to win the game.

I've seen some raster-level scraping techniques used in game dev testing 15 years ago that would really bother some of these internet police officers.

[0] https://www.w3.org/TR/webdriver2/

[1] https://chromedevtools.github.io/devtools-protocol/

blibble10mo ago

> "Stealth" crawlers are always going to win the game.

no, because we'll end up with remote attestation needed to access any site of value

3 more replies

Havoc10mo ago· 1 in thread

Seems a win.

CF being internet police is a problem too but someone credible publicly shaming a company for shady scraping is good. Even if it just creates conversation

jeffrallen10mo ago

Shaming doea not work in the era of "no shame".

1 more reply

binarymax10mo ago· 1 in thread

I've built and run a personal search engine, that can do pretty much what perplexity does from a basic standpoint. Testing with friends it gets about 50/50 preference for their queries vs Perplexity.

metadat10mo ago

This sounds fascinating! Are you able to elaborate on what is different about yours vs Perp'lexity's?

rwmj10mo ago· 1 in thread

In unrelated news, Fedora (the Linux distro) has been taken down by a DDoS today which I understand is AI-scraping related: https://pagure.io/fedora-infrastructure/issue/12703

st3fan10mo ago

The last comment there now reads:

"It was actually a caching issue on our end. ;) I just fixed it a few min ago..."

Lets not go on a witch hunt and blame everything on AI scrapers.

1 more reply

czk10mo ago· 1 in thread

the year is 2045.

"sir, have you been botting today?"

"how much have you had to bot today?", its voice taking on an empathetic tone that was personalized for your particular profile

"yeah... im gonna need you to exit the vehicle and take a field humanity test"

benreesman10mo ago

greentext courtesy of opus prompted in greentext by me.

\>mfw bots are better at being human than humans \>it's over for carboncels ```

willguest10mo ago· 1 in thread

> The Internet as we have known it for the past three decades is rapidly changing, but one thing remains constant: it is built on trust.

jgrall10mo ago

Ain't that the truth.

talkingtab10mo ago· 1 in thread

I wonder if DRM is useful for this. The problem: I want people to access my site, but not Google, not bots, not crawlers and certainly not for use by AI.

Corporations use this to protect their content. I should be able to protect mine as well. What's good for the goose.

1gn1510mo ago

https://en.wikipedia.org/wiki/Analog_hole

hamishmacewan10mo ago· 1 in thread

icetank10mo ago

mrbald10mo ago· 1 in thread

hsbauauvhabzb10mo ago

Because scrapers would certainly comply with that

1 more reply

kocial10mo ago· 1 in thread

Those Challenges can be bypassed too using various browser automation. With the Comet-like tool, Perplexity can advance its crawling activity with much more human-like behaviour.

ipaddr10mo ago

If they can trick the ad networks then go for it. If the ad networks can detect it and exclude those visits we should be able to.

caesil10mo ago· 1 in thread

Cloudflare is an enemy of the open and freely accessible web.

jgrall10mo ago

If by "open and freely accessible" you mean there should be no rules of the road, then I suppose yes. Personally, I'm glad CF is pushing back on this naive mentality.

bob102910mo ago· 1 in thread

ch_fr10mo ago

I mean, it did happen, don't you remember in March when SourceHut had outages because their most expensive endpoints were being spammed by scrapers?

Don't you remember the reason Anubis even came to be?

amai10mo ago· 1 in thread

What if their “crawler” is just cheap human labor in some country with very low wages? Would that be allowed, because these are not robots?

coffeeenjoyer10mo ago

What if they have significant robotic body parts? Or what if they make heavy use of automation processes and they barely click a button to index a page (so they just maniacally click all day long)?

What if robots.txt should refer to the ultimate beneficiaries... one which in this case would be the AI product that uses that content... to serve another ultimate beneficiary, a human user.

The problem here is obviously the higher prices for hosting the content, and less revenue for those that serve ads, have product placement on their sites, etc.

As long as robots.txt is about ethics/money and is enforced by morality, it doesn't matter who it refers to anyway.

harvie10mo ago· 1 in thread

Trung024610mo ago

robots txt file size is usually small and would not alert external services.

throw_m23933910mo ago· 1 in thread

> How can you protect yourself?

Put your valuable content behind a paywall.

b0ner_t0ner10mo ago

A combination of "Bypass Paywalls Clean for Firefox" and archive.is usually get past these.

1 more reply

decide100010mo ago· 1 in thread

C'mon CF. What are you doing? You are literally breaking the internet with your police behaviour. Starts to look like the Great Firewall.

jgrall10mo ago

Not affiliated with CF in any way. Respectfully disagree. Calling out bad actors is in the public interest.

2 more replies

thoroughburro10mo ago· 1 in thread

[flagged]

tomhow10mo ago

> I genuinely hope you feel shame. I would shun you in real life.

HN is not a platform for attacking people, even imagined ones.

Please don't fulminate.

https://news.ycombinator.com/newsguidelines.html

nostrademons10mo ago

https://news.ycombinator.com/item?id=44220860

kylestanfield10mo ago

Perplexity claims that you can “use the following robots.txt tags to manage how their sites and content interact with Perplexity.” https://docs.perplexity.ai/guides/bots

fsckboy10mo ago

the internet needs micropayments (probably millipayments). if crawlers want to pay me a penny a page, crawl me 24-7 plz

if I am willing to pay a penny a page, i and the people like me won't have to put up with clickwrap nonsense

free access doesn't have to be shut off (ok, it will be, but it doesn't have to be, and doesn't that tell you something?)

because where the internet is headed is just more and more trash.

1 more reply

djoldman10mo ago

The cat's out of the bag / pandora's box is opened with respect to AI training data.

Cloudflare and other companies are leveraging outrage to acquire more users, which is fine... users want to feel like AI companies aren't going to get their data.

The faster that AI companies are excluded from categories of data, the faster they will shift to categories from which they're not excluded.

ddalessa10mo ago

Cloudflare's test was to setup a dummy domain that had never been indexed, and had blocks in the robots.txt and the firewall.

Its not very convincing.

jp101610mo ago

62746710mo ago

We learned to dislike "bubbles" in the past decades but bubbles make sense and are natural, obviously if you're not alone in it.

When it becomes awfully busy with machines and machine content humans will learn to reconnect.

daft_pink10mo ago

I’m just curious at what point ai is a crawler and at what point ai is a client when the user is directing the searches and the ai is executing them.

Perplexity Comet sort of blurs the lines there as does typing quesitons into Claude.

madrox10mo ago

lonelyasacloud10mo ago

Given Apple’s existing relationships with publishers through its music, movies, books, and news services, it’s not hard to imagine them attempting a similar play now.

gonzo4110mo ago

This is expected. There are not rules or conventions anymore. Look at LLMs, they stole/pirated all knowledge....no consequences.

rapatel010mo ago

This is brilliant marketing and strategy from Cloudflare. They are pointing out bad actors and selling a service where they can be the private security guards for your website.

I think there could be something interesting if they made a caching pub-sub model for data scraping. In addition or in place of trying to be security guards.

micromacrofoot10mo ago

Every major AI platform is doing this right now, it's effectively impossible to avoid having your content vacuumed up by LLMs if you operate on the public web.

I've given up and restored to IP based rate-limiting to stay sane. I can't stop it, but I can (mostly) stop it from hurting my servers.

znpy10mo ago

At work I'm considering blocking all the ip prefixes announced by ASNs owned by Microsoft and other companies known for their LLMs. At this point it seems like the only viable solutions.

LLM scrapers bots are starting to make up a lot of our egress traffic and that is starting to weight on our bills.

qwerty45612710mo ago

5pl1n73r10mo ago

gtvwill10mo ago

Adhering to robots.txt is merely a courtesy.

Much like a trolley drop off at your local shopping center car park. Some users will adhere to it and drop their trolleys in after their done. Others will not and will leave it wherever.

Don't like it? Then don't post your website on the internet...

account4210mo ago

Of course their proposed solution is to hand over the keys to Buttflare so that the problem goes away.

No thanks, you don't counter shit with more but slightly different shit.

crossroadsguy10mo ago

Cloudflare already questions my humanity and then every once in a while blocks me with zero recourse. Now they are literally proposing more control and gatekeeping.

Where have we all come on the Internet? Are we openly going back to the wild west of bounty hunters and Pinkertons (in a way)?

hrpnk10mo ago

Previously it was all sniper and sneaker bots scanning websites for product availability and attempting purchases continuously to snipe when it comes back online.

zeld410mo ago

Internet was built on trust, but not anymore. It's a Darwinian system; everyone has to find their own way to survive.

Cloudflare will help their publisher to block more aggresively, and AI companies will up their game too. Harvest information online is hard labor that needs to be paid for, either to AI, or to human.

buremba10mo ago

bilater10mo ago

S4H10mo ago

yesIreadIt10mo ago

elphinstone10mo ago

They don't have the monopoly advantage of Google who has already stolen everything, so hard to feel outraged here. In fact it shows insidious Google's monopolistic stranglehold truly is.

mathiaspoint10mo ago

All user agents are robots, some just have an associated person. Ban UAs that abuse the network but beyond that there's really nothing you can practically do if you actually want a website.

throwmeaway22210mo ago

Change "no-crawl" to "will-sue"

and see if that fixes the problem.

codecracker300110mo ago

> we were able to fingerprint this crawler using a combination of machine learning and network signals.

what machine learning algorithms are they using? time to deploy them onto our websites

ankmb10mo ago

Will AI companies come up with a model to incentivise content creation. Is it necessary for their long term survival? And is it not imperative to happen?

ed_mercer10mo ago

>OpenAI is an example of a leading AI company that follows these best practices.

Except when their agents happily click the "I"m not a robot" checkbox.

KETpXDDzR10mo ago

> We were able to fingerprint this crawler using a combination of machine learning and network signals.

Yikes. AntiVirus scanners for website access.

emsign10mo ago

AI companies are just thieves with big money lawyers. What do you expect from so much criminal energy? They will never stop, they are crazy.

tucnak10mo ago

If you don't want to get scrapped, don't put up your stuff online.

UltraSane10mo ago

Any information you make available on the internet WILL be accessed by ANYONE and you CANNOT STOP THIS.

ergocoder10mo ago

Cloudflare shading Perplexity is an unexpected drama of this year.

I had to check that this did come out of CloudFlare.

pera10mo ago

If you don't understand why this is at least slightly controversial I imagine you are not a normal folk.

dhanushreddy2910mo ago

PS: perplexity is using cloudflare browser rendering to scrape websites

tonyhart710mo ago

people want LLM to access website but wait until those LLM given access to make a comment, write a reviews, moderation etc

now suddenly everything on the net is fake if not already are

zzo38computer10mo ago

I do not want to block curl and lynx. But if they claim to be Chrome then I don't care if Chrome is blocked

wordofx10mo ago

Good on perplexity.

tr_user10mo ago

use anubis to throw up a POW challenge

kotaKat10mo ago

An AI service violating peoples’ consent? Say it isn’t so! Those damn assult-culture techbros at it again.

lofaszvanitt10mo ago

Cloudflare now acting like a self made police station of the internet.

chuckreynolds10mo ago

insert 'shocked' emoji face here

nialse10mo ago

echo42null10mo ago

Hmm, I’ve always seen robots.txt more as a polite request than an actual rule.

oriettaxx10mo ago

I've jyst asked perplexity ai itself: this is the answer

j / k navigate · click thread line to collapse