This is funny coming from Cloudflare, the company that blocks most of the internet from being fetched with antispam checks even for a single web request. The internet we knew was open and not trusted , but thanks to companies like Cloudflare, now even the most benign , well meaning attempt to GET a website is met with a brick wall. The bots of Big Tech, namely Google, Meta and Apple are of course exempt from this by pretty much every website and by cloudflare. But try being anyone other than them , no luck. Cloudflare is the biggest enabler of this monopolistic behavior
That said, why does perplexity even need to crawl websites? I thought they used 3rd party LLMs. And those LLMs didn't ask anyones permission to crawl the entire 'net.
Also the "perplexity bots" arent crawling websites, they fetch URLs that the users explicitly asked. This shouldnt count as something that needs robots.txt access. It's not a robot randomly crawling, it's the user asking for a specific page and basically a shortcut for copy/pasting the content
Cloudflare only needs to exist because the server doesn't get paid when a user or bot requests resources. Advertising only needs to exist because the publisher doesn't get paid when a user or bot requests resources.
And the thing is... people already pay for internet. They pay their ISP. So people are perfectly happy to pay for resources that they consume on the Internet, and they already have an infrastructure for doing so.
I feel like the answer is that all web requests should come with a price tag, and the ISP that is delivering the data is responsible for paying that price tag and then charging the downstream user.
It's also easy to ratelimit. The ISP will just count the price tag as 'bytes'. So your price could be 100 MB or whatever (independent of how large the response is), and if your internet is 100 mbps, the ISP will stall out the request for 8 seconds, and then make it. If the user aborts the request before the page loads, the ISP won't send the request to the server and no resources are consumed.
I agree, but your idea below that is overly complicated. You can't micro-transact the whole internet.
That idea feels like those episodes of Star Trek DS9 that take place on Feregenar - where you have to pay admission and sign liability wavers to even walk on the sidewalk outside. It's not a true solution.
I agree that end-users cannot handle micro transactions across the whole internet. That said, I would like to point out that most of the internet is blanketed in ads and ads involve tons of tiny quick auctions and micro transactions that occur on each page load.
It is totally possible for a system to evolve involving tons of tiny transactions across page loads.
And, the whole internet is already micro-transactioned! Every page with ads is doing a bidding war and spending money on your attention. The only person not allowed to bid is yourself!
Clearly you don't have the lobes for business /s
But it's done through a bait and switch. They serve the full article to Google, which allows Google to show you excerpts that you have to pay for.
It would be better if Google shows something like PAYMENT REQUIRED on top, at least that way I know what I'm getting at.
I'm old enough to remember when that was grounds for getting your site removed from Google results - "cloaking" was against the rules. You couldn't return one result for Googlebot, and another for humans.
No idea when they stopped doing that, but they obviously have let go of that principle.
If pages can't be served for free, all internet content is at the mercy of payment processors and their ideas of "brand safety".
That content can't be served entirely for free doesn't mean that all content will require payment, and so is subject to issues with payment processors, just that some things may gravitate back to a model where it costs a small amount to host something (i.e. pay for home internet and host bits off that, or you might have VPS out there that runs tools and costs a few $ /yr or /month). I pay for resources to host my bits & bobs instead of relying on services provided in exchange for stalking the people looking at them, this is free for the viewer as they aren't even paying indirectly.
Most things are paid for anyway, even if the person hosting it nor my looking at it are paying directly: adtech arseholes give services to people hosting content in exchange for the ability to stalk us and attempt to divert our attention. Very few sites/apps, other than play/hobby ones like mine or those from more actively privacy focused types, are free of that.
It doesn't just apply to the web, it applies to literally everything that we spend money on via a third party service. Which is... most everything these days.
Curious to hear other perspectives here. Maybe I’m over reacting/misunderstanding.
https://weblog.masukomi.org/2018/03/25/zed-shaws-utu-saving-...
I do believe we will end there eventually, with the emerging tech like Brazil’s and India’s payment architectures it should be a possibility in the coming decades
https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
Sadly development along these lines has not progressed. Yes, Google Cloud and other services may return it and require some manual human intervention, but I'd love to see _automatic payment negotiation_.
I'm hopeful that instant-settlement options like Bitcoin Lightning payments could progress us past this.
https://docs.lightning.engineering/the-lightning-network/l40...
https://hackernoon.com/the-resurgence-of-http-402-in-the-age...
The amount of "verified" paying "users" with a blue checkmark that are just total LLM bots is incredible on there.
As long as spamming and DDOS'ing pays more than whatever the request costs, it will keep existing.
Whatever method is used by Cloudflare for detecting "threats" has nothing to do with consuming resources on the "protected" servers.
The so-called "threats" are identified in users that may make a few accesses per day to a site, transferring perhaps a few kilobytes of useful data on the viewed pages (besides whatever amount of stupid scripts the site designer has implemented).
So certainly Cloudflare does not meter the consumed resources.
Moreover, Cloudflare preemptively annoys any user who accesses for the first time a site, having never consumed any resources, perhaps based on irrational profiling based on the used browser and operating system, and geographical location.
Your idea of micro transacting web requests would play into it and probably end up with a system like Netflix where your ISP has access to a set of content creators to whom they grant ‘unlimited’ access as part of the service fee.
I’d imagine that accessing any content creators which are not part of their package will either be blocked via a paywall (buy an addon to access X creators outside our network each month) or charged at an insane price per MB as is the case with mobile data.
Obvious this is all super hypothetical but weirder stuff has happened in my lifetime
Because I as an user would be glad to have "free sites only" filter, and then just steal content :))
But it's an interesting idea and thought experiment.
That seems pretty unreasonable.
Only this week I have witnessed several dozen cases when Cloudflare has blocked normal Web page accesses without any possible correct reason, and this besides the normal annoyance of slowing every single access to any page on their "protected" sites with a bot check popup window.
"why are we cutting all the trees in the park?" really you want trees to fall on your kid and crushing them to death?? what's wrong with saving kids??
"why are we closing the water in the fountains in the town?" really you want your kids to drown into the fountains or drink contaminated water??
However, in the last few months, Cloudflare has become increasingly annoying. I suspect that they might have implemented some "AI" "threat" detection, which gives much more false positives than before.
For instance, this week I have frequently been blocked when trying to access the home page of some sites where I am a paid subscriber, with a completely cryptic message "The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.".
The only "action" that I have done was opening the home page of the site, where I would then normally login with my credentials.
Also, during the last few days I have been blocked from accessing ResearchGate. I may happen to hit a few times per day some page on the ResearchGate site, while searching for various research papers, which is the very purpose of that site. Therefore I cannot understand what stupid algorithm is used by Cloudflare, that it declares that such normal usage is a "threat".
The weird part is that this blocking happens only if I use Firefox (Linux version). With another browser, i.e. Vivaldi or Chrome, I am not blocked.
I have no idea whether Cloudflare specifically associates Firefox on Linux with "threats" or this happens because whatever flawed statistics Cloudflare has collected about my accesses have all recorded the use of Firefox.
In any case, Cloudflare is completely incapable of discriminating between normal usage of a site by a human (which may be a paying customer) and "threats" caused by bots or whatever "threatening" entities might exist according to Cloudflare.
I am really annoyed by the incompetent programmers who implement such dumb "threat detection solutions", which can create major inconveniences for countless people around the world, while the incompetents who are the cause of this are hiding behind their employer corporation and never suffer consequences proportional to the problems that they have caused to others.
Sometimes just refreshing the page seems to work too. Disabling the tracker blocking allows cross-site requests to Cloudflare endpoints which seems to be enough. Maybe worth allow-listing CF domains, but I didn't look into if that is possible yet.
is_using_vpn? -> bad,abuse,ddos
thanks' cloudflare for saving our internet by destroying it...
This exact same thing continues in 2025 with Windows Defender. The cheaper Windows Server VMs in the various cloud providers are practically unusable until you disable it.
You can tell this stuff is no longer about protecting users or property when there are no meaningful workarounds or exceptions offered anymore. You must use defender (or Cloudflare) unless you intend to be a naughty pirate user.
I think half of this stuff is simply an elaborate power trip. Human egos are fairly predictable machines in aggregate.
Plenty of site/service owners explicitly want Google, Meta and Apple bots (because they believe they have a symbiotic relationship with it) and don't want your bot because they view you as, most likely, parasitic.
I don't think it's fair to blame Cloudflare for that. That's looking at a pool of blood and not what caused it: the bots/traffic which predate LLMs. And Cloudflare is working to fix it with the PrivacyPass standard (which Apple joined).
Each website is freely opting-into it. No one was forced. Why not ask yourself why that is?
The Big Tech bots provide proven value to most sites. They have also through the years proven themselves to respect robots.txt, including crawl speed directives.
If you manage a site with millions of pages, and over the course of a couple years you see tens of new crawlers start to request at the same volume as Google, and some of them crawl at a rate high enough (and without any ramp-up period) to degrade services and wake up your on-call engineers, and you can't identify a benefit to you from the crawlers, what are you going to do? Are you going to pay a lot more to stop scaling down your cluster during off-peak traffic, or are you going to start blocking bots?
Cloudflare happens to be the largest provider of anti-DDoS and bot protection services, but if it wasn't them, it'd be someone else. I miss the open web, but I understand why site operators don't want to waste bandwidth and compute on high-volume bots that do not present a good value proposition to them.
Yes this does make it much harder for non-incumbents, and I don't know what to do about that.
https://www.robotstxt.org/faq/what.html
I wonder if cloudflare users explicitly have to allow google or if it's pre-allowed for them when setting up cloudflare.
Despite what Cloudflare wants us to think here, the web was always meant to be an open information network , and spam protection should not fundamentally change that characteristic.
But at end of day it's up to the site operator, and any server or reverse proxy provides an easy way to block well-behaved bots that use a consistent user-agent.
They provide valeu for their companies. If you get some value from them it's just a side effect.
1) It takes your query, and given the complexity might expand it to several search queries using an LLM. ("rephrasing")
2) It runs queries against a web search index (I think it was using Bing or Brave at first, but they probably have their own by now), and uses an LLM to decide which are the best/most relevant documents. It starts writing a summary while it dives into sources (see next).
3) If necessary it will download full source documents that popped up in search to seed the context when generating a more in-depth summary/answer. They do this themselves because using OpenAI to do it is far more expensive.
#3 is the problem. Especially because SEO has really made it so the same sites pop up on top for certain classes of queries. (for example Reddit will be on top for product reviews alot). These sites operate on ad revenue so their incentive is to block. Perplexity does whatever they can in the game of sidestepping the sites' wishes. They are a bad actor.
EDIT: I should also add that Google, Bing, and others, always obey robots.txt and they are good netizens. They have enough scale and maturity to patiently crawl a site. I wholeheartedly agree that if an independent site is also a good netizen, they should not be blocked. If Perplexity is not obeying robots.txt and they are impatient, they should absolutely be blocked.
Why is it okay for me to ask my browser to do this but I can’t ask my LLM to do the same?
When ChatGPT reads a review website, though? Zero ad clicks, zero affiliate links.
Am I misunderstanding something. I (the site owner) pay Cloudflare to do this. It is my fault this happens, not Cloudflare's.
I've only ever seen a Cloudflare interstitial when viewing a page with my VPN on, for example -- something I'm happy about as a site owner and accept quite willingly as a VPN user knowing the kinds of abuse that occur over VPN.
Monopolistic is the wrong word, because you have the problem backwards. Cloudflare isnt helping Apple/Google... It's helping its paying consumers and those are the only services those consumers want to let through.
Do you know how I can predict that AI agents, the sort that end users use to accomplish real tasks, will never take off? Because the people your agent would interact with want your EYEBALLS for ads, build anti patterns on purpose, want to make it hard to unsubscribe, cancel, get a refund, do a return.
AI that is useful to people will fail. For the same reason that no one has great public API's any more. Because every public companies real customers are its stock holders, and the consumers are simply a source of revenue. One that is modeled, marked to, and manipulated all in the name of returns on investment.
I was recently working on a project where I needed to find out the published date for a lot of article links and this came helpful. Not sure if it's changed recently but asking ChatGPT, Gemini etc didn't work and it said that it doesn't have access to the current websites. However, asking perplexity, it fetched the website in real time and gave me the info I needed.
I do agree with the rest of your comment that this is not a random robot crawling. It was doing what a real user (me) asked it to fetch.
You say "shouldn't" here, but why?
There seems to be a fundamental conflict between two groups who each assert they have "rights":
* Content consumers claim the right to use whatever software they want to consume content.
* Content creators claim the right to control how their content is consumed (usually so that they can monetize it).
These two "rights" are in direct conflict.
The bias here on HN, at least in this thread, is clearly towards the first "right". And I tend to come down on this side myself, as a computer power user. I hate that I cannot, for example, customize the software I use to stream movies from popular streaming services.
But on the other hand, content costs money to make. Creators need to eat. If the content creators cannot monetize their content, then a lot of that content will stop being made. Then what? That doesn't seem good for anyone, right?
Whether or not you think they have the "right", Perplexity totally breaks web content monetization. What should we do about that?
(Disclosure: I work for Cloudflare but not on anything related to this. I am speaking for myself, not Cloudflare.)
It'd likely be a fantastic good if "content creators" stopped being able to eat from the slop they shovel. In the meantime, the smarter the tools that let folks never encounter that form of "content", the more they will pay for them.
There remain legitimate information creation or information discovery activities that nobody used to call "content". One can tell which they are by whether they have names pre-existing SEO, like "research" or "journalism" or "creative writing".
Ad-scaffolding, what the word "content" came to mean, costs money to make, ideally less than the ads it provides a place for generate. This simple equation means the whole ecosystem, together with the technology attempting to perpetuate it as viable, is an ouroboros, eating its own effluvia.
It is, I would argue, undetermined that advertising-driven content as a business model has a "right" to exist in today's form, rather than any number of other business models that sufficed for millennia of information and artistry before.
Today LLMs serve both the generation of additional literally brain-less content, and the sifting of such from information worth using. Both sides are up in arms, but in the long run, it sure seems some other form of information origination and creativity is likely to serve everyone better.
If they want the RSS feeds to be accessible then they should configure it to allow those requests.
Anyone circumventing bans is doing something shitty and ilegal, see the computer fraud and abuse act and craiglist v 3taps.
"And those LLMs didn't ask anyones permission to crawl the entire 'net."
False, openai respects robots.txt, doesnt mask ips, paid a bunch of $ to reddit.
You either side with the law or with criminals.
You can't even say the same thing about openAI because we don't know the corpus they train their models on.
we seeing many post about site owner that got hit by millions request because of LLM, we cant blame cloudflare for this because it literally neccessary evil
Sure, the internet should be open and not trusted. But physical reality exists. Hosting and bandwidth cost money. I trust Google won't DDoS my site or cost my an arbitrary amount of money. I won't trust bots made by random people on the internet in the same way. The fact that Google respects robots.txt while Perplexity doesn't tells you why people trust Google more than random bots.
Google already has access to any webpage because its own search Crawlers are allowed by most websites, and google crawls recursively. Thus Gemini has an advantage of this synergy with google search. Perplexity does not crawl recursively (i presume -- therefore it does not need to consult robots.txt), and it doesn't have synergies with a major search engine.
So you just came here to bitch about Cloudflare? It's wild to even comment on this thread if this does not make sense to you.
They're building a search index. Every AI is going to struggle at being a tool to find websites & business listings without a search index.