undefined | Better HN

0 pointsseydor7mo ago0 comments

> it is built on trust.

This is funny coming from Cloudflare, the company that blocks most of the internet from being fetched with antispam checks even for a single web request. The internet we knew was open and not trusted , but thanks to companies like Cloudflare, now even the most benign , well meaning attempt to GET a website is met with a brick wall. The bots of Big Tech, namely Google, Meta and Apple are of course exempt from this by pretty much every website and by cloudflare. But try being anyone other than them , no luck. Cloudflare is the biggest enabler of this monopolistic behavior

That said, why does perplexity even need to crawl websites? I thought they used 3rd party LLMs. And those LLMs didn't ask anyones permission to crawl the entire 'net.

Also the "perplexity bots" arent crawling websites, they fetch URLs that the users explicitly asked. This shouldnt count as something that needs robots.txt access. It's not a robot randomly crawling, it's the user asking for a specific page and basically a shortcut for copy/pasting the content

0 comments

Taek7mo ago

We're moving progressively in the direction of "pages can't be served for free anymore". Which, I don't think is a problem, and in fact I think it's something we should have addressed a long time ago.

Cloudflare only needs to exist because the server doesn't get paid when a user or bot requests resources. Advertising only needs to exist because the publisher doesn't get paid when a user or bot requests resources.

And the thing is... people already pay for internet. They pay their ISP. So people are perfectly happy to pay for resources that they consume on the Internet, and they already have an infrastructure for doing so.

I feel like the answer is that all web requests should come with a price tag, and the ISP that is delivering the data is responsible for paying that price tag and then charging the downstream user.

It's also easy to ratelimit. The ISP will just count the price tag as 'bytes'. So your price could be 100 MB or whatever (independent of how large the response is), and if your internet is 100 mbps, the ISP will stall out the request for 8 seconds, and then make it. If the user aborts the request before the page loads, the ISP won't send the request to the server and no resources are consumed.

dabockster7mo ago

> We're moving progressively in the direction of "pages can't be served for free anymore". Which, I don't think is a problem, and in fact I think it's something we should have addressed a long time ago.

I agree, but your idea below that is overly complicated. You can't micro-transact the whole internet.

That idea feels like those episodes of Star Trek DS9 that take place on Feregenar - where you have to pay admission and sign liability wavers to even walk on the sidewalk outside. It's not a true solution.

vineyardmike7mo ago

> You can't micro-transact the whole internet.

I agree that end-users cannot handle micro transactions across the whole internet. That said, I would like to point out that most of the internet is blanketed in ads and ads involve tons of tiny quick auctions and micro transactions that occur on each page load.

It is totally possible for a system to evolve involving tons of tiny transactions across page loads.

2 more replies

Taek7mo ago

The presented solution has invisible UX via layering it into existing metered billing.

And, the whole internet is already micro-transactioned! Every page with ads is doing a bidding war and spending money on your attention. The only person not allowed to bid is yourself!

sellmesoap7mo ago

> You can't micro-transact the whole internet.

Clearly you don't have the lobes for business /s

OptionOfT7mo ago

But it's done through a bait and switch. They serve the full article to Google, which allows Google to show you excerpts that you have to pay for.

It would be better if Google shows something like PAYMENT REQUIRED on top, at least that way I know what I'm getting at.

mh-7mo ago

> They serve the full article to Google, which allows Google to show you excerpts that you have to pay for.

I'm old enough to remember when that was grounds for getting your site removed from Google results - "cloaking" was against the rules. You couldn't return one result for Googlebot, and another for humans.

No idea when they stopped doing that, but they obviously have let go of that principle.

1 more reply

AlexandrB7mo ago

A scary observation in light of another front page article right now: https://news.ycombinator.com/item?id=44783566

If pages can't be served for free, all internet content is at the mercy of payment processors and their ideas of "brand safety".

dspillett7mo ago

“Free” could have a number of meanings here. Free to the viewer, free to the hoster, free to the creator, etc…

That content can't be served entirely for free doesn't mean that all content will require payment, and so is subject to issues with payment processors, just that some things may gravitate back to a model where it costs a small amount to host something (i.e. pay for home internet and host bits off that, or you might have VPS out there that runs tools and costs a few $ /yr or /month). I pay for resources to host my bits & bobs instead of relying on services provided in exchange for stalking the people looking at them, this is free for the viewer as they aren't even paying indirectly.

Most things are paid for anyway, even if the person hosting it nor my looking at it are paying directly: adtech arseholes give services to people hosting content in exchange for the ability to stalk us and attempt to divert our attention. Very few sites/apps, other than play/hobby ones like mine or those from more actively privacy focused types, are free of that.

Taek7mo ago

That's already a deep problem for all of society. If we don't want that to be an ongoing issue, we need to make sure money is a neutral infrastructure.

It doesn't just apply to the web, it applies to literally everything that we spend money on via a third party service. Which is... most everything these days.

Forgeties797mo ago

My first reaction: This solution would basically kill what little remaining fun there is to be had browsing the Internet and all but assure no new sites/smaller players will ever see traffic.

Curious to hear other perspectives here. Maybe I’m over reacting/misunderstanding.

armchairhacker7mo ago

Depending on the implementation (a big if) it would help smaller websites, because it would make hosting much cheaper. ISPs don’t choose what sites users visit, only what they pay. As long as the ISP isn’t giving significant discounts to visiting big sites (just charging a fixed rate per bytes downloads and uploaded) and charging something reasonable, visiting a small site would be so cheap (a few cents at most, but more likely <1 cent) users won’t weigh cost at all.

1 more reply

Analemma_7mo ago

If site operators can’t afford the costs of keeping sites up in the face of AI scraping, the new/smaller sites are gone anyway.

1 more reply

Saline95157mo ago

Why would I pay for a page if I don't know if the content is what I asked for? How much are you going to pay? How much are you going to charge? This will end up in SEO hell, especially with AI-generated pages farming paid clicks.

seer7mo ago

Hah still remember the old “solving the internet with hate” idea from Zed Shaw in the glory days of Ruby on Rails.

https://weblog.masukomi.org/2018/03/25/zed-shaws-utu-saving-...

I do believe we will end there eventually, with the emerging tech like Brazil’s and India’s payment architectures it should be a possibility in the coming decades

nazcan7mo ago

I think value is not proportional to bytes - an AI only needs to read a page once to add it to its model, and then served the effectively cached data many times.

chromatin7mo ago

402 Payment Required

https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

Sadly development along these lines has not progressed. Yes, Google Cloud and other services may return it and require some manual human intervention, but I'd love to see _automatic payment negotiation_.

I'm hopeful that instant-settlement options like Bitcoin Lightning payments could progress us past this.

https://docs.lightning.engineering/the-lightning-network/l40...

https://hackernoon.com/the-resurgence-of-http-402-in-the-age...

bboygravity7mo ago

I get your thinking, but x.com is proof that simply making users pay (quite a lot) does not eliminate bots.

The amount of "verified" paying "users" with a blue checkmark that are just total LLM bots is incredible on there.

As long as spamming and DDOS'ing pays more than whatever the request costs, it will keep existing.

adrian_b7mo ago

Your theory does not match the practice of Cloudflare.

Whatever method is used by Cloudflare for detecting "threats" has nothing to do with consuming resources on the "protected" servers.

The so-called "threats" are identified in users that may make a few accesses per day to a site, transferring perhaps a few kilobytes of useful data on the viewed pages (besides whatever amount of stupid scripts the site designer has implemented).

So certainly Cloudflare does not meter the consumed resources.

Moreover, Cloudflare preemptively annoys any user who accesses for the first time a site, having never consumed any resources, perhaps based on irrational profiling based on the used browser and operating system, and geographical location.

makingstuffs7mo ago

As time passes I’m more certain in the belief that the internet will end up being a licensed system with insanely high barriers to entry which will stop your average dev from even being able to afford deploying a hobby project on it.

Your idea of micro transacting web requests would play into it and probably end up with a system like Netflix where your ISP has access to a set of content creators to whom they grant ‘unlimited’ access as part of the service fee.

I’d imagine that accessing any content creators which are not part of their package will either be blocked via a paywall (buy an addon to access X creators outside our network each month) or charged at an insane price per MB as is the case with mobile data.

Obvious this is all super hypothetical but weirder stuff has happened in my lifetime

debesyla7mo ago

Wouldn't this lead to pirated page clones where customer pays less for same-ish content, and less, all the way down to essentially free?

Because I as an user would be glad to have "free sites only" filter, and then just steal content :))

But it's an interesting idea and thought experiment.

armchairhacker7mo ago

That’s fine. The point for website owners isn’t to make money, it’s to not spend money hosting (or more specifically, to pay a small fixed rate hosting). They want people to see the content; if someone makes the content more accessible, that’s a good thing.

2 more replies

Terretta7mo ago

Or, flip this, don't expect to get paid for pamphleteering?

novok7mo ago

The reason why that didn’t work was because regulations made micropayments too expensive, and the government wants it that way to keep control over the financial system.

andy997mo ago

Can't agree more, cloudflare is destroying the internet. We've entered the equivalent of when having McAffe antivirus was worse than having an actual virus because it slowed down your computer to much. These user hostile solutions have taken us back to dialup era page loading speeds for many sites, it's absurd that anyone thinks this is a service worth paying for.

rstat17mo ago

So server owners are just supposed to bend over and take all the abuse they get from shitty bots and DDOS attacks and do nothing?

That seems pretty unreasonable.

adrian_b7mo ago

Unreasonable is to use such incompetent companies like Cloudflare, which are absolutely incapable of distinguishing between the normal usage of a Web site by humans and DDOS attacks or accesses done by bots.

Only this week I have witnessed several dozen cases when Cloudflare has blocked normal Web page accesses without any possible correct reason, and this besides the normal annoyance of slowing every single access to any page on their "protected" sites with a bot check popup window.

1 more reply

spwa47mo ago

No they're supposed to allow scraping and information aggregation. That's the essence of the web: it's all text, crawlable, machine-readable (sort of) and parseable. Feel free to block ddos'es.

1 more reply

madrox7mo ago

There is a difference between blocking abusive behavior and blocking all bots. No one really cared about bot scraping to this degree before AI scraping for training purposes became a concern. This is fearmongering by Cloudflare for website maintainers who haven't figured out how to adapt to the AI era so they'll buy more Cloudflare.

1 more reply

inetknght7mo ago

No, they're supposed to rally together and fight for better laws and enforcement of those laws. Which is, arguably, exactly what they've done just in a way that you and I don't like.

2 more replies

CharlesW7mo ago

Ethics-free organizations and individuals like Perplexity are why Cloudflare exists. If you have a better way to solve the problems that they solve, the marketplace would reward you handsomely.

Terretta7mo ago

Do you think users shouldn't get to have user agents or that "content farm ads scaffold" as a business model has a right to be viable? Forcing users to reward either stance seems unsustainable.

1 more reply

adrian_b7mo ago

While the existence of Perplexity may justify the existence of Cloudflare, it does not justify the incompetence of Cloudflare, which is unable to distinguish accesses done by Perplexity and the like from normal accesses done by humans, who use those sites exactly for the purpose they exist, so there cannot be any excuse for the failure of Cloudflare to recognize this.

kz201290877mo ago

Cloudflare operates with such biased logic such as "why are we shooting at all men who have a long beard?" really you want the terrorists to kill all your kids then???

"why are we cutting all the trees in the park?" really you want trees to fall on your kid and crushing them to death?? what's wrong with saving kids??

"why are we closing the water in the fountains in the town?" really you want your kids to drown into the fountains or drink contaminated water??

adrian_b7mo ago

In the previous years, I did not have many problems with Cloudflare.

However, in the last few months, Cloudflare has become increasingly annoying. I suspect that they might have implemented some "AI" "threat" detection, which gives much more false positives than before.

For instance, this week I have frequently been blocked when trying to access the home page of some sites where I am a paid subscriber, with a completely cryptic message "The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.".

The only "action" that I have done was opening the home page of the site, where I would then normally login with my credentials.

Also, during the last few days I have been blocked from accessing ResearchGate. I may happen to hit a few times per day some page on the ResearchGate site, while searching for various research papers, which is the very purpose of that site. Therefore I cannot understand what stupid algorithm is used by Cloudflare, that it declares that such normal usage is a "threat".

The weird part is that this blocking happens only if I use Firefox (Linux version). With another browser, i.e. Vivaldi or Chrome, I am not blocked.

I have no idea whether Cloudflare specifically associates Firefox on Linux with "threats" or this happens because whatever flawed statistics Cloudflare has collected about my accesses have all recorded the use of Firefox.

In any case, Cloudflare is completely incapable of discriminating between normal usage of a site by a human (which may be a paying customer) and "threats" caused by bots or whatever "threatening" entities might exist according to Cloudflare.

I am really annoyed by the incompetent programmers who implement such dumb "threat detection solutions", which can create major inconveniences for countless people around the world, while the incompetents who are the cause of this are hiding behind their employer corporation and never suffer consequences proportional to the problems that they have caused to others.

epakai7mo ago

I'm running into this as well (Firefox Debian). I suspect it may be Firefox's tracker blocking combined with the older extended support release.

Sometimes just refreshing the page seems to work too. Disabling the tracker blocking allows cross-site requests to Cloudflare endpoints which seems to be enough. Maybe worth allow-listing CF domains, but I didn't look into if that is possible yet.

kz201290877mo ago

yes exactly, cloudflare is just bad tech where the remedy is worse than the disease. I am using a VPN and I get endless loops of please verify you are not a robot, this may take a few second (minutes, hours...)... so basically cloudflare tech must have this primitive code

is_using_vpn? -> bad,abuse,ddos

thanks' cloudflare for saving our internet by destroying it...

bob10297mo ago

> when having McAffe antivirus was worse than having an actual virus because it slowed down your computer to much

This exact same thing continues in 2025 with Windows Defender. The cheaper Windows Server VMs in the various cloud providers are practically unusable until you disable it.

You can tell this stuff is no longer about protecting users or property when there are no meaningful workarounds or exceptions offered anymore. You must use defender (or Cloudflare) unless you intend to be a naughty pirate user.

I think half of this stuff is simply an elaborate power trip. Human egos are fairly predictable machines in aggregate.

1 more reply

eddythompson807mo ago

> The bots of Big Tech, namely Google, Meta and Apple are of course exempt from this by pretty much every website and by cloudflare. But try being anyone other than them , no luck. Cloudflare is the biggest enabler of this monopolistic behavior

Plenty of site/service owners explicitly want Google, Meta and Apple bots (because they believe they have a symbiotic relationship with it) and don't want your bot because they view you as, most likely, parasitic.

seydorOP7mo ago

they didnt seem to mind when openai et al. took all their content to train LLMs when they were still parasites that didn't have a symbiotic relationship. This thinking is kind of too pro-monopolist for me

eddythompson807mo ago

Pretty sure they DID mind that. It's what the whole post is about.

golergka7mo ago

That’s a good thing. You want an LLM to know about product or service you are selling and promote it to its users. Getting into the training data is the new SEO.

concinds7mo ago

> The internet we knew was open and not trusted , but thanks to companies like Cloudflare, now even the most benign , well meaning attempt to GET a website is met with a brick wall

I don't think it's fair to blame Cloudflare for that. That's looking at a pool of blood and not what caused it: the bots/traffic which predate LLMs. And Cloudflare is working to fix it with the PrivacyPass standard (which Apple joined).

Each website is freely opting-into it. No one was forced. Why not ask yourself why that is?

seydorOP7mo ago

do you think that every well-meaning GET request should be treated the same way as a distributed attack ? The latter is the reason why people use CF not the former.

concinds7mo ago

The line can be extremely blurry (that's putting it mildly), and "the latter" is not the only reason people use CF (actually, I wouldn't be surprised at all if it wasn't even the biggest reason).

1 more reply

renrutal7mo ago

How does one tell a "well-meaning" request from an attack?

1 more reply

benregenspan7mo ago

The Big Tech bots provide proven value to most sites. They have also through the years proven themselves to respect robots.txt, including crawl speed directives.

If you manage a site with millions of pages, and over the course of a couple years you see tens of new crawlers start to request at the same volume as Google, and some of them crawl at a rate high enough (and without any ramp-up period) to degrade services and wake up your on-call engineers, and you can't identify a benefit to you from the crawlers, what are you going to do? Are you going to pay a lot more to stop scaling down your cluster during off-peak traffic, or are you going to start blocking bots?

Cloudflare happens to be the largest provider of anti-DDoS and bot protection services, but if it wasn't them, it'd be someone else. I miss the open web, but I understand why site operators don't want to waste bandwidth and compute on high-volume bots that do not present a good value proposition to them.

Yes this does make it much harder for non-incumbents, and I don't know what to do about that.

seydorOP7mo ago

it's because those SEO bots keep crawling over and over, which perplexity does not seem to do (considering that the URLS are user-requested). Those are different cases and robots.txt is only about the former. Cloudflare in this case is not doing "ddos protection" because i presume Perplexity does not constantly refetch or crawl or ddos the website (If perplexity does those things then they are guilty)

https://www.robotstxt.org/faq/what.html

I wonder if cloudflare users explicitly have to allow google or if it's pre-allowed for them when setting up cloudflare.

Despite what Cloudflare wants us to think here, the web was always meant to be an open information network , and spam protection should not fundamentally change that characteristic.

benregenspan7mo ago

I believe that AI crawlers are the main thing that is currently blocked by default when you enroll a new site. No traditional crawlers are blocked, it's not that the big incumbents are allow-listed. And I think that clearly marked "user request" agents like ChatGPT-User are not blocked by default.

But at end of day it's up to the site operator, and any server or reverse proxy provides an easy way to block well-behaved bots that use a consistent user-agent.

akagusu7mo ago

> The Big Tech bots provide proven value to most sites.

They provide valeu for their companies. If you get some value from them it's just a side effect.

benregenspan7mo ago

It goes without saying that they are profit-oriented. The point is that they historically offered a clear trade: let us crawl you, and we will refer traffic to you. An AI crawler does not provide clear value back. An AI user request agent might or might not provide enough clear value back for sites to want to participate. (Same goes for the search incumbents if they go all-in on LLM search results and don't refer much traffic back).

binarymax7mo ago

Here's how perplexity works:

1) It takes your query, and given the complexity might expand it to several search queries using an LLM. ("rephrasing")

2) It runs queries against a web search index (I think it was using Bing or Brave at first, but they probably have their own by now), and uses an LLM to decide which are the best/most relevant documents. It starts writing a summary while it dives into sources (see next).

3) If necessary it will download full source documents that popped up in search to seed the context when generating a more in-depth summary/answer. They do this themselves because using OpenAI to do it is far more expensive.

#3 is the problem. Especially because SEO has really made it so the same sites pop up on top for certain classes of queries. (for example Reddit will be on top for product reviews alot). These sites operate on ad revenue so their incentive is to block. Perplexity does whatever they can in the game of sidestepping the sites' wishes. They are a bad actor.

EDIT: I should also add that Google, Bing, and others, always obey robots.txt and they are good netizens. They have enough scale and maturity to patiently crawl a site. I wholeheartedly agree that if an independent site is also a good netizen, they should not be blocked. If Perplexity is not obeying robots.txt and they are impatient, they should absolutely be blocked.

pests7mo ago

What’s wrong with it downloading documents when the user asks it to? My browser also downloads whole documents and sometimes even prefetches documents I haven’t even clicked on yet. Toss in a adblocker or reader mode and my browser also strips all the ads.

Why is it okay for me to ask my browser to do this but I can’t ask my LLM to do the same?

michaelt7mo ago

When Google sends people to a review website, 30% of users might have an adblocker, but 70% don't. And even those with adblockers might click an affiliate link if they found the review particularly helpful.

When ChatGPT reads a review website, though? Zero ad clicks, zero affiliate links.

1 more reply

binarymax7mo ago

There’s nothing wrong with downloading documents. I do this in my personal search app. But if you are hammering the site that wants you to calm down, or bypass robots.txt, that’s wrong.

1 more reply

pkilgore7mo ago

> This is funny coming from Cloudflare, the company that blocks most of the internet from being fetched with antispam checks even for a single web request.

Am I misunderstanding something. I (the site owner) pay Cloudflare to do this. It is my fault this happens, not Cloudflare's.

layer87mo ago

You’re paying Cloudflare to not get DDoS-attacked or swamped by illegitimate requests. GP is implying that Cloudflare could do a better job of not blocking legitimate, benign requests.

pkilgore7mo ago

Then we're all operating with very different definitions of legitimate or benign!

I've only ever seen a Cloudflare interstitial when viewing a page with my VPN on, for example -- something I'm happy about as a site owner and accept quite willingly as a VPN user knowing the kinds of abuse that occur over VPN.

zer00eyz7mo ago

> The internet we knew was open and not trusted ... monopolistic behavior

Monopolistic is the wrong word, because you have the problem backwards. Cloudflare isnt helping Apple/Google... It's helping its paying consumers and those are the only services those consumers want to let through.

Do you know how I can predict that AI agents, the sort that end users use to accomplish real tasks, will never take off? Because the people your agent would interact with want your EYEBALLS for ads, build anti patterns on purpose, want to make it hard to unsubscribe, cancel, get a refund, do a return.

AI that is useful to people will fail. For the same reason that no one has great public API's any more. Because every public companies real customers are its stock holders, and the consumers are simply a source of revenue. One that is modeled, marked to, and manipulated all in the name of returns on investment.

Zak7mo ago

I disagree about AI agents, at least those that work by automating a web browser that a human could also use. I suppose Google's proposal to add remote attestation to Chrome might make it a little harder, but that seems to be dead for now (and I hope forever).

seydorOP7mo ago

As agents become more useful, the monetization model will shift to something ... that we haven't though of yet.

busymom07mo ago

> why does perplexity even need to crawl websites?

I was recently working on a project where I needed to find out the published date for a lot of article links and this came helpful. Not sure if it's changed recently but asking ChatGPT, Gemini etc didn't work and it said that it doesn't have access to the current websites. However, asking perplexity, it fetched the website in real time and gave me the info I needed.

I do agree with the rest of your comment that this is not a random robot crawling. It was doing what a real user (me) asked it to fetch.

mastodon_acc7mo ago

As a website owner I definitely want the capability allow and block certain crawlers. If I say I don’t want crawlers from Perplexity they should respect that. This sneaky evasion just highlights that company is not to be trusted, and I would definitely pay any hosting provider that helps me enforce blocking parasitic companies like perplexity.

rat99887mo ago

Don't they need a search index?

kentonv7mo ago

> the "perplexity bots" arent crawling websites, they fetch URLs that the users explicitly asked. This shouldnt count as something that needs robots.txt access. It's not a robot randomly crawling, it's the user asking for a specific page and basically a shortcut for copy/pasting the content

You say "shouldn't" here, but why?

There seems to be a fundamental conflict between two groups who each assert they have "rights":

* Content consumers claim the right to use whatever software they want to consume content.

* Content creators claim the right to control how their content is consumed (usually so that they can monetize it).

These two "rights" are in direct conflict.

The bias here on HN, at least in this thread, is clearly towards the first "right". And I tend to come down on this side myself, as a computer power user. I hate that I cannot, for example, customize the software I use to stream movies from popular streaming services.

But on the other hand, content costs money to make. Creators need to eat. If the content creators cannot monetize their content, then a lot of that content will stop being made. Then what? That doesn't seem good for anyone, right?

Whether or not you think they have the "right", Perplexity totally breaks web content monetization. What should we do about that?

(Disclosure: I work for Cloudflare but not on anything related to this. I am speaking for myself, not Cloudflare.)

kiratp7mo ago

The web browsers that the AI companies are about to ship will make requests that are indistinguishable from user requests. The ship on trying to save minimization has sailed.

kentonv7mo ago

We will be able to distinguish them.

Terretta7mo ago

"Creators" need to eat, OK, but there's no right to get paid to paste yesterday's recycled newspapers on my laptop screen. Making that unprofitable seems incredibly good for by and large everyone.

It'd likely be a fantastic good if "content creators" stopped being able to eat from the slop they shovel. In the meantime, the smarter the tools that let folks never encounter that form of "content", the more they will pay for them.

There remain legitimate information creation or information discovery activities that nobody used to call "content". One can tell which they are by whether they have names pre-existing SEO, like "research" or "journalism" or "creative writing".

Ad-scaffolding, what the word "content" came to mean, costs money to make, ideally less than the ads it provides a place for generate. This simple equation means the whole ecosystem, together with the technology attempting to perpetuate it as viable, is an ouroboros, eating its own effluvia.

It is, I would argue, undetermined that advertising-driven content as a business model has a "right" to exist in today's form, rather than any number of other business models that sufficed for millennia of information and artistry before.

Today LLMs serve both the generation of additional literally brain-less content, and the sifting of such from information worth using. Both sides are up in arms, but in the long run, it sure seems some other form of information origination and creativity is likely to serve everyone better.

cantaccesrssbit7mo ago

I crawl 3000 RSS feeds once a week. Let me tell you! Cloudflare sucks. What business is it of theirs to block something that is meant to be accessed by everyone. Like an RSS feeds. FU Cloudflare.

KomoD7mo ago

That's not Cloudflare's fault, that's the website owner's fault.

If they want the RSS feeds to be accessible then they should configure it to allow those requests.

TZubiri7mo ago

Websites and any business really, have the right to impose terms of use and deny service.

Anyone circumventing bans is doing something shitty and ilegal, see the computer fraud and abuse act and craiglist v 3taps.

"And those LLMs didn't ask anyones permission to crawl the entire 'net."

False, openai respects robots.txt, doesnt mask ips, paid a bunch of $ to reddit.

You either side with the law or with criminals.

seydorOP7mo ago

Is that also how e.g. antrhopic trained on libgen?

You can't even say the same thing about openAI because we don't know the corpus they train their models on.

golergka7mo ago

Ironically, cloudflare is also the reason OpenAI agent mode with web use isn’t very usable right now. Every second time I asked it to do a mundane task like checking me in for a flight it couldn’t because of cloudflare.

tonyhart77mo ago

what ironic with this???

we seeing many post about site owner that got hit by millions request because of LLM, we cant blame cloudflare for this because it literally neccessary evil

blantonl7mo ago

Ask yourself why so many content hosting platforms utilize CLoudflare's services and then contrast that perspective with your posted one. Might enlighten you a bit to think about that for a second.

bwb7mo ago

I could not keep my website up without Cloudflare given the level of bot and AI crawlers hammering things. I try whenever to do challenges, but sometimes I have to block entire AS blocks.

pphysch7mo ago

Spam and DDOS are serious problems, it's not fair to suggest Cloudflare is just doing this to gatekeep the Internet for its own sake.

seydorOP7mo ago

It's definitely not a DDOS when it's a single http request per year. I don't know if they do it on purpose but the fact is none of the big tech crawlers are limited.

zaphar7mo ago

This is most attributable to the fact that traffic is essentially anonymous so the source ip address is the best that a service can do if it's trying to protect an endpoint.

ok1234567mo ago

ovh does a good job with ddos

raincole7mo ago

I'm sorry, but that's some crazy take.

Sure, the internet should be open and not trusted. But physical reality exists. Hosting and bandwidth cost money. I trust Google won't DDoS my site or cost my an arbitrary amount of money. I won't trust bots made by random people on the internet in the same way. The fact that Google respects robots.txt while Perplexity doesn't tells you why people trust Google more than random bots.

seydorOP7mo ago

agree to disagree , but:

Google already has access to any webpage because its own search Crawlers are allowed by most websites, and google crawls recursively. Thus Gemini has an advantage of this synergy with google search. Perplexity does not crawl recursively (i presume -- therefore it does not need to consult robots.txt), and it doesn't have synergies with a major search engine.

jklinger4107mo ago

> That said, why does perplexity even need to crawl websites?

So you just came here to bitch about Cloudflare? It's wild to even comment on this thread if this does not make sense to you.

They're building a search index. Every AI is going to struggle at being a tool to find websites & business listings without a search index.

j / k navigate · click thread line to collapse

0 comments

Taek7mo ago

I feel like the answer is that all web requests should come with a price tag, and the ISP that is delivering the data is responsible for paying that price tag and then charging the downstream user.

dabockster7mo ago

I agree, but your idea below that is overly complicated. You can't micro-transact the whole internet.

vineyardmike7mo ago

> You can't micro-transact the whole internet.

It is totally possible for a system to evolve involving tons of tiny transactions across page loads.

2 more replies

Taek7mo ago

The presented solution has invisible UX via layering it into existing metered billing.

And, the whole internet is already micro-transactioned! Every page with ads is doing a bidding war and spending money on your attention. The only person not allowed to bid is yourself!

sellmesoap7mo ago

> You can't micro-transact the whole internet.

Clearly you don't have the lobes for business /s

OptionOfT7mo ago

But it's done through a bait and switch. They serve the full article to Google, which allows Google to show you excerpts that you have to pay for.

It would be better if Google shows something like PAYMENT REQUIRED on top, at least that way I know what I'm getting at.

mh-7mo ago

> They serve the full article to Google, which allows Google to show you excerpts that you have to pay for.

No idea when they stopped doing that, but they obviously have let go of that principle.

1 more reply

AlexandrB7mo ago

A scary observation in light of another front page article right now: https://news.ycombinator.com/item?id=44783566

If pages can't be served for free, all internet content is at the mercy of payment processors and their ideas of "brand safety".

dspillett7mo ago

“Free” could have a number of meanings here. Free to the viewer, free to the hoster, free to the creator, etc…

Taek7mo ago

That's already a deep problem for all of society. If we don't want that to be an ongoing issue, we need to make sure money is a neutral infrastructure.

It doesn't just apply to the web, it applies to literally everything that we spend money on via a third party service. Which is... most everything these days.

Forgeties797mo ago

My first reaction: This solution would basically kill what little remaining fun there is to be had browsing the Internet and all but assure no new sites/smaller players will ever see traffic.

Curious to hear other perspectives here. Maybe I’m over reacting/misunderstanding.

armchairhacker7mo ago

1 more reply

Analemma_7mo ago

If site operators can’t afford the costs of keeping sites up in the face of AI scraping, the new/smaller sites are gone anyway.

1 more reply

Saline95157mo ago

seer7mo ago

Hah still remember the old “solving the internet with hate” idea from Zed Shaw in the glory days of Ruby on Rails.

https://weblog.masukomi.org/2018/03/25/zed-shaws-utu-saving-...

I do believe we will end there eventually, with the emerging tech like Brazil’s and India’s payment architectures it should be a possibility in the coming decades

nazcan7mo ago

I think value is not proportional to bytes - an AI only needs to read a page once to add it to its model, and then served the effectively cached data many times.

chromatin7mo ago

402 Payment Required

https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

I'm hopeful that instant-settlement options like Bitcoin Lightning payments could progress us past this.

https://docs.lightning.engineering/the-lightning-network/l40...

https://hackernoon.com/the-resurgence-of-http-402-in-the-age...

bboygravity7mo ago

I get your thinking, but x.com is proof that simply making users pay (quite a lot) does not eliminate bots.

The amount of "verified" paying "users" with a blue checkmark that are just total LLM bots is incredible on there.

As long as spamming and DDOS'ing pays more than whatever the request costs, it will keep existing.

adrian_b7mo ago

Your theory does not match the practice of Cloudflare.

Whatever method is used by Cloudflare for detecting "threats" has nothing to do with consuming resources on the "protected" servers.

So certainly Cloudflare does not meter the consumed resources.

makingstuffs7mo ago

Obvious this is all super hypothetical but weirder stuff has happened in my lifetime

debesyla7mo ago

Wouldn't this lead to pirated page clones where customer pays less for same-ish content, and less, all the way down to essentially free?

Because I as an user would be glad to have "free sites only" filter, and then just steal content :))

But it's an interesting idea and thought experiment.

armchairhacker7mo ago

2 more replies

Terretta7mo ago

Or, flip this, don't expect to get paid for pamphleteering?

novok7mo ago

The reason why that didn’t work was because regulations made micropayments too expensive, and the government wants it that way to keep control over the financial system.

andy997mo ago

rstat17mo ago

So server owners are just supposed to bend over and take all the abuse they get from shitty bots and DDOS attacks and do nothing?

That seems pretty unreasonable.

adrian_b7mo ago

1 more reply

spwa47mo ago

No they're supposed to allow scraping and information aggregation. That's the essence of the web: it's all text, crawlable, machine-readable (sort of) and parseable. Feel free to block ddos'es.

1 more reply

madrox7mo ago

1 more reply

inetknght7mo ago

No, they're supposed to rally together and fight for better laws and enforcement of those laws. Which is, arguably, exactly what they've done just in a way that you and I don't like.

2 more replies

CharlesW7mo ago

Ethics-free organizations and individuals like Perplexity are why Cloudflare exists. If you have a better way to solve the problems that they solve, the marketplace would reward you handsomely.

Terretta7mo ago

Do you think users shouldn't get to have user agents or that "content farm ads scaffold" as a business model has a right to be viable? Forcing users to reward either stance seems unsustainable.

1 more reply

adrian_b7mo ago

kz201290877mo ago

Cloudflare operates with such biased logic such as "why are we shooting at all men who have a long beard?" really you want the terrorists to kill all your kids then???

"why are we cutting all the trees in the park?" really you want trees to fall on your kid and crushing them to death?? what's wrong with saving kids??

"why are we closing the water in the fountains in the town?" really you want your kids to drown into the fountains or drink contaminated water??

adrian_b7mo ago

In the previous years, I did not have many problems with Cloudflare.

The only "action" that I have done was opening the home page of the site, where I would then normally login with my credentials.

The weird part is that this blocking happens only if I use Firefox (Linux version). With another browser, i.e. Vivaldi or Chrome, I am not blocked.

epakai7mo ago

I'm running into this as well (Firefox Debian). I suspect it may be Firefox's tracker blocking combined with the older extended support release.

kz201290877mo ago

is_using_vpn? -> bad,abuse,ddos

thanks' cloudflare for saving our internet by destroying it...

bob10297mo ago

> when having McAffe antivirus was worse than having an actual virus because it slowed down your computer to much

This exact same thing continues in 2025 with Windows Defender. The cheaper Windows Server VMs in the various cloud providers are practically unusable until you disable it.

I think half of this stuff is simply an elaborate power trip. Human egos are fairly predictable machines in aggregate.

1 more reply

eddythompson807mo ago

seydorOP7mo ago

eddythompson807mo ago

Pretty sure they DID mind that. It's what the whole post is about.

golergka7mo ago

That’s a good thing. You want an LLM to know about product or service you are selling and promote it to its users. Getting into the training data is the new SEO.

concinds7mo ago

> The internet we knew was open and not trusted , but thanks to companies like Cloudflare, now even the most benign , well meaning attempt to GET a website is met with a brick wall

Each website is freely opting-into it. No one was forced. Why not ask yourself why that is?

seydorOP7mo ago

do you think that every well-meaning GET request should be treated the same way as a distributed attack ? The latter is the reason why people use CF not the former.

concinds7mo ago

The line can be extremely blurry (that's putting it mildly), and "the latter" is not the only reason people use CF (actually, I wouldn't be surprised at all if it wasn't even the biggest reason).

1 more reply

renrutal7mo ago

How does one tell a "well-meaning" request from an attack?

1 more reply

benregenspan7mo ago

The Big Tech bots provide proven value to most sites. They have also through the years proven themselves to respect robots.txt, including crawl speed directives.

Yes this does make it much harder for non-incumbents, and I don't know what to do about that.

seydorOP7mo ago

https://www.robotstxt.org/faq/what.html

I wonder if cloudflare users explicitly have to allow google or if it's pre-allowed for them when setting up cloudflare.

Despite what Cloudflare wants us to think here, the web was always meant to be an open information network , and spam protection should not fundamentally change that characteristic.

benregenspan7mo ago

But at end of day it's up to the site operator, and any server or reverse proxy provides an easy way to block well-behaved bots that use a consistent user-agent.

akagusu7mo ago

> The Big Tech bots provide proven value to most sites.

They provide valeu for their companies. If you get some value from them it's just a side effect.

benregenspan7mo ago

binarymax7mo ago

Here's how perplexity works:

1) It takes your query, and given the complexity might expand it to several search queries using an LLM. ("rephrasing")

pests7mo ago

Why is it okay for me to ask my browser to do this but I can’t ask my LLM to do the same?

michaelt7mo ago

When ChatGPT reads a review website, though? Zero ad clicks, zero affiliate links.

1 more reply

binarymax7mo ago

There’s nothing wrong with downloading documents. I do this in my personal search app. But if you are hammering the site that wants you to calm down, or bypass robots.txt, that’s wrong.

1 more reply

pkilgore7mo ago

> This is funny coming from Cloudflare, the company that blocks most of the internet from being fetched with antispam checks even for a single web request.

Am I misunderstanding something. I (the site owner) pay Cloudflare to do this. It is my fault this happens, not Cloudflare's.

layer87mo ago

You’re paying Cloudflare to not get DDoS-attacked or swamped by illegitimate requests. GP is implying that Cloudflare could do a better job of not blocking legitimate, benign requests.

pkilgore7mo ago

Then we're all operating with very different definitions of legitimate or benign!

zer00eyz7mo ago

> The internet we knew was open and not trusted ... monopolistic behavior

Zak7mo ago

seydorOP7mo ago

As agents become more useful, the monetization model will shift to something ... that we haven't though of yet.

busymom07mo ago

> why does perplexity even need to crawl websites?

I do agree with the rest of your comment that this is not a random robot crawling. It was doing what a real user (me) asked it to fetch.

mastodon_acc7mo ago

rat99887mo ago

Don't they need a search index?

kentonv7mo ago

You say "shouldn't" here, but why?

There seems to be a fundamental conflict between two groups who each assert they have "rights":

* Content consumers claim the right to use whatever software they want to consume content.

* Content creators claim the right to control how their content is consumed (usually so that they can monetize it).

These two "rights" are in direct conflict.

Whether or not you think they have the "right", Perplexity totally breaks web content monetization. What should we do about that?

(Disclosure: I work for Cloudflare but not on anything related to this. I am speaking for myself, not Cloudflare.)

kiratp7mo ago

The web browsers that the AI companies are about to ship will make requests that are indistinguishable from user requests. The ship on trying to save minimization has sailed.

kentonv7mo ago

We will be able to distinguish them.

Terretta7mo ago

"Creators" need to eat, OK, but there's no right to get paid to paste yesterday's recycled newspapers on my laptop screen. Making that unprofitable seems incredibly good for by and large everyone.

cantaccesrssbit7mo ago

I crawl 3000 RSS feeds once a week. Let me tell you! Cloudflare sucks. What business is it of theirs to block something that is meant to be accessed by everyone. Like an RSS feeds. FU Cloudflare.

KomoD7mo ago

That's not Cloudflare's fault, that's the website owner's fault.

If they want the RSS feeds to be accessible then they should configure it to allow those requests.

TZubiri7mo ago

Websites and any business really, have the right to impose terms of use and deny service.

Anyone circumventing bans is doing something shitty and ilegal, see the computer fraud and abuse act and craiglist v 3taps.

"And those LLMs didn't ask anyones permission to crawl the entire 'net."

False, openai respects robots.txt, doesnt mask ips, paid a bunch of $ to reddit.

You either side with the law or with criminals.

seydorOP7mo ago

Is that also how e.g. antrhopic trained on libgen?

You can't even say the same thing about openAI because we don't know the corpus they train their models on.

golergka7mo ago

tonyhart77mo ago

what ironic with this???

we seeing many post about site owner that got hit by millions request because of LLM, we cant blame cloudflare for this because it literally neccessary evil

blantonl7mo ago

Ask yourself why so many content hosting platforms utilize CLoudflare's services and then contrast that perspective with your posted one. Might enlighten you a bit to think about that for a second.

bwb7mo ago

I could not keep my website up without Cloudflare given the level of bot and AI crawlers hammering things. I try whenever to do challenges, but sometimes I have to block entire AS blocks.

pphysch7mo ago

Spam and DDOS are serious problems, it's not fair to suggest Cloudflare is just doing this to gatekeep the Internet for its own sake.

seydorOP7mo ago

It's definitely not a DDOS when it's a single http request per year. I don't know if they do it on purpose but the fact is none of the big tech crawlers are limited.

zaphar7mo ago

This is most attributable to the fact that traffic is essentially anonymous so the source ip address is the best that a service can do if it's trying to protect an endpoint.

ok1234567mo ago

ovh does a good job with ddos

raincole7mo ago

I'm sorry, but that's some crazy take.

seydorOP7mo ago

agree to disagree , but:

jklinger4107mo ago

> That said, why does perplexity even need to crawl websites?

So you just came here to bitch about Cloudflare? It's wild to even comment on this thread if this does not make sense to you.

They're building a search index. Every AI is going to struggle at being a tool to find websites & business listings without a search index.

j / k navigate · click thread line to collapse