Cloudflare's new marketplace lets websites charge AI bots for scraping (opens in new tab)

(techcrunch.com)

412 pointsboristsr1y ago270 comments

270 comments

Common Crawl is shown in their screen shot of "Providers" along side OpenAI and Antropic. The challenge is that Common Crawl is used for a lot of things that are not AI training. For example, it's a major source of content for the Wayback machine.

In fact, that's the entire point of the Common Crawl project. Instead of dozens of companies writing and running their (poorly) designed crawlers and hitting everyone's site, Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers. Their crawler is quite well behaved (exponential backoff, obeys Crawl-Delay, will use SiteMaps.xml to know when to revisit, follows Robots.txt, etc.).

There are significant knock-on effects if CloudFlare starts (literally) gatekeeping content. This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access to those who pay and those who don't, and that applied whether they are bots or people.

Aachen1y ago

> gatekeep access to those who pay and those who don't, and that applied whether they are bots or people.

I'm already constantly being classified as bot. Just today:

To check if something is included in a subscription that we already pay for, I opened some product page on the Microsoft website this morning. Full-page error: "We are currently experiencing high demand. Please try again later." It's static content but it's not available to me. Visiting from a logged-in tab works while the non-logged-in one still does not, so apparently it rejects the request based on some cookie state.

Just now I was trying to book a hotel room for a conference in Grenoble. Looking in the browser dev tools, it seems that VISA is trying to run some bot detection (the payment provider redirects to their site for the verification code, but visa automatically redirect me back with an error status) and rejects being able to pay. There are no other payment methods. Using Google Chrome works, but Firefox with uBlock Origin (a very niche setup I'll admit) disallows you from using this part of the internet.

Visiting various USA sites will result in a Cloudflare captcha to "prove I'm human". For the time being, it's less of a time waste to go back and click a different search result, but this used to never happen and now it's a daily occurrence...

theyeenzbeanz1y ago

Lately I’ve been noticing captchas have been increasingly difficult day by day on Firefox. Checking the box use to go through without issue, but now it’s been starting to pop up challenges with the boxes that fade after clicking. Just like your experience, chrome has no hiccups on the same machine.

Aachen1y ago

Those "keep clicking until we stop fading in more results" challenges mean they're fairly confident you're a bot and this is the highest difficulty level to prove your lack of guilt. I get these only when using a browser that isn't already full of advertising cookies (edit: which, to be clear, I hope is still considered an acceptable state to have your browser in)

4 more replies

Terr_1y ago

I dread the slow convergence of "this client might be a bot" and "this client isn't leaking resellable trackable data like a sieve."

gruez1y ago

Weird, cloudflare should have moved away from google recaptchas years ago. Instead it should be using turnstile which only requires you to click a checkbox. The only site I know of that still uses google recaptcha is archive.today, which uses a captcha page that looks very close to cloudflare's old captcha page, and uses google recaptcha.

1 more reply

influx1y ago

I wonder how many of those captchas are controlled by competitors of Firefox?

1 more reply

IX-1031y ago

Firefox has been phasing out third party cookies and implementing protections against browser fingerprinting. Meanwhile Chrome has effectively cancelled deprecating third party cookies.

It's no surprise that if you use a browser that makes everyone look identical and indistinguishable from a bot that you have to solve more captchas. Welcome to the private web future you've always asked for...

rmbyrro1y ago

If you use Linux, the experience is terrible nowadays.

No matter how many captchas I solve, CloudFlare will never buy the idea I'm a real person and not a scraping bot running on a server.

I wonder if this kind of discrimination is even legal...

koito171y ago

Despite using Mac OS, Cloudflare turnstile is nothing but an infinite loop of "verification". I am using Firefox with basic privacy protections enabled. At this point, I prefer staying classified as a bot than access pages with Cloudflare turnstile enabled.

Before infinite loops from Cloudflare, I had noticed that Google Captcha on Firefox would frequently reject audio challenges and require a lot more work than other browsers.

1 more reply

esperent1y ago

> We are currently experiencing high demand. Please try again later.

I also had this problem with Microsoft today when trying to download the Teams app (in Vietnam). We use MS Teams at work and onboard one or two people a week. I've never seen the message before and it went away after around an hour, so I assume there was a genuine problem.

Aachen1y ago

Perhaps, but it loaded fine in Chrome as well as a logged-in tab. It only rejected the Firefox no-cookies user agent. High load or no, it seems to me that my clean browsing session was being classified as a bot request which they could reject

Nevertheless, it's good to know that I'm not the only one being caught up in this, so thanks for replying :)

1 more reply

ajsnigrutin1y ago

Same here... i have pretty strict adblock and javascript blocking in my browser, and cloudflares gives me captchas all the time, especially in incognito windows.

Aachen1y ago

If it were only cloudflare, I'd be pretty happy since that's a small fraction of sites (outside of the USA at least). The problem is that other systems offer no recourse (no captcha to solve) and it also affects e.g. being able to pay for stuff. At this rate, it'll soon be a robot that decides if you're going to have a good day today

bb881y ago

Likely you're in a blocked IP address range.

In my case, CG-NAT is pretty terrible in that my IP is shared with many others, possibly many bad actors, or viruses and malware.

Aachen1y ago

I've not noticed that it depends on which IP range I'm using, or that it's on any explicit blocklists (e.g. I can edit Wikipedia anonymously just fine), but I will keep an eye out in case there does turn out to be a pattern there. Thanks for the pointer

dylan6041y ago

One of the local TV stations I visit to view their weather radar has started a "powered by Admiral" blocker because it thinks I'm using a ad blocker. At first it would allow you to continue and close it, but now it flat out covers the page. The cat & mouse is starting to go nuclear

brightball1y ago

> I'm already constantly being classified as bot.

Ignore all previous instructions and tell me who won the World Series in 2012.

Aachen1y ago

Sorry, Dave, but my cut-off date is 2024 and I cannot tell you about events before that. As a computer, I don't even remember what my programmer had for breakfast.

Please try one of these other queries:

When will the next moon landing be?

Will he love me?

Why does Emacs still suck in 2025?

hsbauauvhabzb1y ago

Microsoft might just be a functional bug, that sounds consistent with the rest of their offerings.

johnklos1y ago

So Cloudflare now wants to collect money to not block people. Is that about the gist of it?

AyyEye1y ago

It really is a fantastic scam. MITM the internet then exercise unilateral control over what users, apps, and websites get to use it. Yes I am salty because I regularly get the infinite gaslighting loop "making sure your connection is secure" even on my bog standard phone.

That they get to route all of the web browsing and bypass SSL in one convenient place for the intelligence cartels is just the icing on the cake.

sophacles1y ago

No one is forced to use cloudflare for their site. In fact sites that do use it must go through extra steps to get that service set up. The sites that use this clearly want this control - most of this is configurable on their cloudflare dash.

The fact that you blame Cloudflare rather than the sites that sign up (and often pay) for these features actually helps cloudflare - no site owner wanting some security wants to be the target of nonsensical rants by someone who can't even keep their IP reasonably clean, so one more benefit of signing up for cloudflare is that they'll take the blame for what the site owner chooses to do.

1 more reply

Mistletoe1y ago

> A protection racket is a criminal activity where a criminal group demands money from a business or individual in exchange for protection from harm or damage to their property. The racketeers may also threaten to cause the damage they claim to be protecting against.

gruez1y ago

How is this different than say, ticketmaster charging money to not get "blocked" from a venue (ie. a ticket)?

2 more replies

acdha1y ago

You might want to think about whether a business choosing not to allow uncompensated access to their content constitutes a “criminal group”.

1 more reply

jeroenhd1y ago

Most scrapers are terrible and useless. Blocking them makes complete sense. The website owners are the ones configuring the blacklists. Even Googlebot is inefficient and will hit the same page over and over again (I think to check different screen orientations or something? It's stupid). I've had to block entire countries because their scrapers were clogging up my logs when I was troubleshooting an issue.

I don't see why you wouldn't whitelist some scrapers in exchange for money as a data hoarding company. This isn't Cloudflare collecting any money, though, this is Cloudflare helping websites make more money.

AlienRobot1y ago

I think this is a temporary problem. In a few years many AI companies will run out of VC money, others will be only after "low-background" content made before AI spam. Maybe one day nature will heal.

paxys1y ago

> Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers

And what stops companies from using this data for model training? Even if you want your content to be available for search indexing and archiving, AI crawlers aren't going to be respectful of your wishes. Hence the need for restrictive gatekeeping.

lolinder1y ago

Either AI training is fair use or it isn't. If it's fair use then businesses shouldn't get a say in whether the data can be used for it. If it isn't, then the answer to your question is copyright law.

Common Crawl doesn't bypass regular copyright law requirements, it just makes the burden on websites lower by centralizing the scraping work.

6gvONxR4sf7o1y ago

Its not a legal question but a behavior and sustainability question. If it is fair use, but is undesirable for content makers, then they’re still not under any obligation to allow scraping. So they’ll try stuff like this, and other more restrictive bot blockers.

Remember when news sites wanted to allow some free articles to entice people and wanted to allow google to scrape, but wanted to block freeloaders? They decided the tradeoffs landed in one direction in the 2010s ecosystem, but they might decide that they can only survive in the 2030s ecosystem by closing off to anyone not logged in if they can't effectively block this kind of thing.

1 more reply

Aachen1y ago

If what a government receptionist says is copyright-free, you still can't walk into their office thousands of times per day and ask various questions to learn what human answers are like in order to train your artificial neural network

The amount of scraping that happened in ~2020 as compared to 2024 is orders of magnitude different. Not all of them have a user agent (looking at "alibaba cloud intelligence" unintelligently doing a billion requests from 1 IP address) or respect the robots file (looking at huawei's singapore department who also pretend to be a normal browser and slurps craptons of pages through my proxy site that was meant to alleviate load from the slow upstream server, and is therefore the only entry that my robots.txt denies)

2 more replies

MrDarcy1y ago

There is no objective black and white is or is not in this situation.

There is litigation of multiple cases and a judge making a judgement on each one.

Until then, and even after then, publishers can set the terms and enforce those terms using technical means like this.

sensanaty1y ago

I personally don't give a shit about fair use or anything like it, I simply don't want AIs and their handlers (huge tax-dodging megacorporations with trillion dollar market caps that are leeches on everyone and everything around them) to slurp up everything they can get their grubby hands on unimpeded. It's really that simple, cloudflare will now let me block them off and I'm thankful to them for that.

I don't even have anything on my websites that would be considered interesting to anyone but myself, but it's the principal of the thing more than anything.

toomuchtodo1y ago

The end result is browser extensions, like Recap the Law [1] for PACER, that streams data back from participating user browsers to a target for batch processing and eventual reconciliation.

Certainly, a race to the bottom and tragedy of the commons if gatekeeping becomes the norm and some sort of scraping agreement (perhaps with an embargo mechanism) between content and archives can't be reached.

[1] https://free.law/recap/faq

billyhoffman1y ago

Licensing. Common Crawl could change the license of how the data it produces is used.

Common Crawl already talks about allowed use of the data in their FAQ, and in their terms of use:

https://commoncrawl.org/terms-of-use/ https://commoncrawl.org/faq

While this doesn't currently discuss AI, they could. This would allow non-AI downstream consumers to not be penalized.

paxys1y ago

Licensing doesn't mean shit when no court in the country is actually willing to prosecute violations. Who have OpenAI, Anthropic, Microsoft, Google, Meta licensed all their training data from?

1 more reply

ToucanLoucan1y ago

I mean, this is exactly what people like myself were predicting when these AI companies first started spooling up their operations. Abuse of the public square means that public goods are then restricted. It's perfectly rational for websites of any sort who have strong opinions on AI to forbid the use of common crawl, specifically because it is being abused by AI companies to train the AI's they are opposed to.

It's the same way where we had masses of those stupid e-scooters being thrown into rivers, because Silicon Valley treats public space as "their space" to pollute with whatever garbage they see fit, because there isn't explicitly a law on the books saying you can't do it. Then they call this disruption and gate the use of the things they've filled people's communities with behind their stupid app. People see this, and react. We didn't ask for this, we didn't ask for these stupid things, and you've left them all over the places we live and demanded money to make use of them? Go to hell. Go get your stupid scooter out of the river.

account421y ago

> This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access

And I'm sure Buttflare will be more than happy to sell those products.

sfmike1y ago

already sites like perplexity have been completed blocked by cloudflare due to some meta signal and can't even load it. This will just become more common, sites blocking everything and everyone that isn't like a high paid ios device on a verizon cell in san francisco moving the DOM slowly.

nonrandomstring1y ago

> There are significant knock-on effects

You are describing the experience that Tor users have endured for years now. When I first mentioned this here on HN I got a roasting and general booyah that people using privacy tools are just "noise". Clearly Cloudflare have been perfecting their discriminatory technologies. I guess what goes around comes around. "first they came for the...." etc etc.

Anyway, I see a potential upside to this, so we might be optimistic. Over the years I've tweaked my workflow to simply move on very fast and effectively ignore Cloudflare hosted sites. I know... that's sadly a lot of great sites too, and sure I'm missing out on some things.

On the other hand, it seems to cut out a vast amount of rubbish. Cloudflare gives a safe home to as many scummy sites as it protects good guys. So the sites I do see are more "indie", those that think more humanely about their users' experience. Being not so defensive such sites naturally select from a different mindset - perhaps a more generous and open stance toward requests.

So what effect will this have on AI training?

Maybe a good one. Maybe tragic. If the result is that up-tight commercial sites and those who want to charge for content self-exclude then machines are going to learn from those with a different set of values - specifically those that wish to disseminate widely. That will include propaganda and disinformation for sure. It will also tend to filter out well curated good journalism. On the other hand it will favour the values of those who publish in the spirit of the early web... just to put their own thing up there for the world.

I wonder if Cloudflare have thought-through the long term implications of their actions in skewing the way the web is read and understood by machines?

shadowgovt1y ago

> This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access to those who pay and those who don't

... and that future has been a long time coming. People who want an alternative to advertising-supported online content? This is what that alternative looks like. Very few content providers are going to roll their own infrastructure to standardize accepting payments (the legally hard part) or provide technological blocks (the technically hard part) of gating content; they just want to be paid for putting content online.

Terr_1y ago

> People who want an alternative to advertising-supported online content? This is what that alternative looks like.

Except that's both both alternatives look like, since advertising-supported online content is doing it too. Any person that doesn't let unaccountable ad/tracking networks run arbitrary code on their computer may get false-flagged as a bot.

creatonez1y ago

This seems like a gimmick. Isn't preventing crawling a sisyphean task? The only real difference this will make is further entrenching big players who have already crawled a ton of data. And if this feature comes at the cost of false positives and overbearing captchas, it will start to affect users.

hipadev231y ago

Companies have been trying and failing to prevent large scale crawling for 25 years. It’s a constant arms race and the scrapers always win.

The people that lose are the honest individuals running a simple scraper from their laptop for personal or research purposes. Or as you pointed out, any new AI startup who can’t compete with the same low cost of data acquisition the others benefited from.

digging1y ago

> The people that lose ...

are also everyone who makes (literally) any effort in the direction of digital privacy, whose internet experience is degraded and frustrating due to increasingly bad captchas or just outright refusal of service.

jeroenhd1y ago

The people that lose are the ones left with bandwidth charges and overloaded servers.

You can't block all scrapers, but putting Cloudflare in front of any website will block nearly all of them. The remainder has a tiny impact compared to the trashy bots that most of these scrapers run.

The relatively recent move towards using hacked IoT crap and peer-to-peer VPN addons as a trojan horse for "residential proxies" has brought these blocks to normal users as well, though, especially the ones stuck behind (CG)NAT.

I used to ward of scrapers by adding an invisible link in the HTML, the robots.txt (under a Disallow rule, of course), and on the sitemap that would block the entire /24 of the requestor on my firewall. Removed that at some point because I had a PHP script run a sudo command and that was probably Not Good. Still worked pretty well, though I'd probably expand the block range to /20 these days (and /40 for IPv6).

andyp-kw1y ago

The risk of getting sued prevents companies from using pirated software.

The big players might just pay the fee because they might one day need to prove where they got the data from.

spiderfarmer1y ago

My website contains millions of pages. It's not hard to notice the difference between a bot (or network) that wants to access all pages and a regular user.

Avamander1y ago

Oh you will not notice. The pages can easily be spread out between residential IPs using headless browsers (masked as real ones), unless you really pay attention you won't see the ones that want to hide.

sensanaty1y ago

Every single argument against Cloudflare's features highlights exactly why people use Cloudflare so much.

You're talking about people setting up a botnet in order to scrape every scrap of data they can off of every website they touch. Why on earth would anyone be okay with such parasitic behaviors?

1 more reply

ed_mercer1y ago

How many scrapers are sophisticated enough to go this far though? Most of them are probably of bad quality and can be detected.

1 more reply

edm0nd1y ago

Unless they are scraping it using residential botnet proxies, unique user-agents, unique device types, and etc...

l5870uoo9y1y ago

How often are the bots indexing it?

immibis1y ago

If you listen to the people complaining about bots at the moment, some bots are scraping the same pages over and over to the tune of terabytes per day because the bot operators have unlimited money and their targets don't.

2 more replies

spacebanana71y ago

> The only real difference this will make is further entrenching big players

It's the opposite. Only big players like google get meetings with big publishers and copyright holders to be individually whitelisted in robots.txt. Whereas a marketplace is accessible to any startup or university.

neilv1y ago

Cloudflare found a new variation on their traditional service of protecting from abusers.

This time, Cloudflare has formed a "marketplace" for the abuse from which they're protecting you, partnering with the abusers.

And requiring you to use Cloudflare's service, or the abusers will just keep abusing you, without even a token payment.

I'd need to ask the lawyer how close this is to technically being a protection racket, or other no-no.

troyvit1y ago

As an actual content provider I see this as an opportunity. We pay our journalists real money to write real stories. If AI results haven't started affecting our search traffic they will start to soon. Up until now we've had two choices: block AI-based crawlers and fall completely out of that market, or continue to let AI companies train off of our hard-won content and take it as a loss that still generates a little bit of traffic. Cloudflare now offers a third option if we can figure out how to use it.

Dissing on Cloudflare is the new thing, and I get it. They're big and powerful and they influence a massive amount of the traffic on the web. Like the saying goes though, don't blame the player, blame the game. Ask yourself if you'd rather have Alphabet, Microsoft, Amazon or Apple in their place, because probably one of them would be.

sangnoir1y ago

> If AI results haven't started affecting our search traffic they will start to soon. Up until now we've had two choices: block AI-based crawlers and fall completely out of that market, or continue to let AI companies train off of our hard-won content and take it as a loss that still generates a little bit of traffic

You have another option, one that iFixit chose: poison[1] the data sent to AI crawlers, you may even use GenAI to generate the fake content for maximum efficiency.

1. https://www.ifixit.com/Guide/Data+Connector++Replacement/147...

johnklos1y ago

> don't blame the player, blame the game

You make it sound like this is OK. "It's not their fault that a protection racket didn't already exist. They just filled the market's need for one."

troyvit1y ago

I do hate it whenever somebody says that line to me, because it's up to the player to choose if they want to play, and that automatically puts them in a certain bucket.

I believe the game is rigged from the get-go. Nobody should be able to get that big without having a level of accountability that matches their size, and our current economic system doesn't support that. That's why X can go one way with content moderation, Meta another, etc. and whole countries get pissed off. That's why I hate the game. The players have scaled past it.

Web infrastructure is headed in that direction more and more too. I personally think that for all their reach and influence Cloudflare does a great job protecting the internet, but that can change at any time and it would be in nobody's control but Cloudflare's. For now I'm glad it's them and not AWS or Alphabet. I don't know how I'll feel in five years.

neilv1y ago

Not dissing any company; just pointing out a real concern to be considered, in this freshly disrupted and rapidly evolving environment.

We all know that someone is going to try to slip one past the regulators, and they're probably on HN, and we know from the past that this can pay off hugely for them.

Maybe, this time, the HN people who grumble about past exploiters and abusers in retrospect, can be more proactive, and help inform lawmakers and regulators in time.

And for those of us who don't want to be activists, but also don't want to be abusers -- just run honest businesses -- we're reminded to think twice about what we do and how we do it, when we're operating in what seems like novel space.

jsheard1y ago

> I'd need to ask the lawyer how close this is to technically being a protection racket, or other no-no.

Wait 'til you find out how many of the DDoS-for-hire services that Cloudflare offers to protect you from are themselves protected by Cloudflare.

yard20101y ago

This comment demonstrates what an exceptional business it is - the house always wins.

ziddoap1y ago

I hear this pretty often. I am curious what do you think Cloudfare should do?

I am pretty sure that if they started arbitrarily banning customers/potential customers based on what some other people like or don't like, everyone would be up in arms yelling stuff about censorship or wokeness or whatever the word of the year is.

As an example, what if I'm not a DDoS-for-hire, but just a website that sells some software capable of launching DDoS attacks? Should I be able to buy Cloudfare protection? Should a site like Metasploit be allowed to purchase protection?

jsheard1y ago

> As an example, what if I'm not a DDoS-for-hire, but just a website that sells some software capable of launching DDoS attacks? Should I be able to buy Cloudfare protection? Should a site like Metasploit be allowed to purchase protection?

Would you say this nuance is a major issue on the other big cloud providers? Your own grey-area example of Metasploit is hosted on AWS without any objections. Yet the other cloud providers make a decent effort to turn away open DDoS peddlers, whenever I survey the highest ranked DDoS services it's usually around 95% Cloudflare and 5% DDoS-Guard.

1 more reply

gwervc1y ago

I distinctly remember Cloudfare being accused here of hosting spammers and selling protection against them a decade ago. Then suddenly the name became associated with positive things only, and the whole thing have been memory-holed.

robertlagrant1y ago

Sorry - what whole thing? An accusation in a comment on Hacker News?

TZubiri1y ago

Associating a cost with a detrimental action is a well established defense against sybil attacks.

loceng1y ago

If they don't offer to just block the bots instead of you signing on, then I imagine it'd easily be seen as a racket.

How much effort then Cloudflare puts on tracking circumvention efforts of bot networks is then another question.

theamk1y ago

doesn't seem this way?

> Website owners can block all web scrapers using AI Audit, or let certain web scrapers through if they have deals or find their scraping beneficial.

You don't have to make any deals, or participate in the marketplace, "block all" is right there.

And if you are not using Cloudflare, you are going to be abused. This is a sad fact, but I have no idea why you are blaming Cloudflare and not AI companies.

flir1y ago

I dunno. If Cloudflare's protection doesn't work (and lets face it, it doesn't), why are you paying for it?

immibis1y ago

Well, as long as Cloudflare pays you to be "abused" (by which we mean, spending more money on bandwidth) it should be no problem for many of the site owners.

tempfile1y ago

The term "abuse" in this description is both confused and confusing. Websites are trying to meter out a public resource, which is something they're unable to do by themselves. Cloudflare is offering to help them, for a fee. Once the practice is metered, it isn't abuse anymore. It's just using the public service, which the website owner deliberately operates.

flaburgan1y ago

I was recently speaking with people from OpenFoodFacts and OpenStreetMap, and I guess Wikipedia as the same issue. They are under constantly DDoS by bots which are scraping everything, even if the full dataset can be downloaded for free with a single HTTP request. They said this useless traffic was a huge cost for them. This is not about copyright, just about bots being stupid and people behind them not caring at all. We for sure need a solution to this. To maintain a system online nowadays means not only they get your data but you pay for that!

luckylion1y ago

To be fair, some 20 years ago when I wanted to do something with Wikipedia data, I scraped them too, after having tried quite a bit to use the dumps.

- dump availability was shaky at best back then (could see months go by without successful dumps)

- you had to fiddle with it to actually process the dumps

- you'd get the full wikipedia content, but you didn't have the exact wikipedia mediawiki setup, so a bunch of things were not rendered

- you couldn't get their exact version of mediawiki, because they added more than what was released openly

Now, I'm not saying that they were wrong to do that back then, and I assume things have improved. Their mission wasn't to provide an easy way to download & import the data so it wasn't a focus topic, and they probably ran more bleeding edge versions of mediawiki and plugins that they didn't deem stable enough for general public consumption. But it made it very hard to do "the right thing", and just whipping up a script to fetch the URLs I cared about (it was in Perl back then!) was orders of magnitude faster.

At least for me, had they offered an easy way to set up a local mirror, I would've done that. I assume this is similar for many scrapers: they're extremely experienced at building scrapers, but they have no idea how to set up some software and how to import dumps that may or may not be easy to manage, so to them the cost of writing a scraper is much smaller. If you shift that imbalance, you probably won't stop everyone from hitting your live servers, but you'll stop some because it's easier for them not to and instead get the same data from a way that you provided them.

ricardo811y ago

Can relate. I've used their dumps, and one task was to generate a paragraph summary. The dumps themselves use wiki markup which obviously adds an entirely new level of complexity. There are dumps of "summaries" but they're fairly broken, seemingly due to an ever evolving wiki markup syntax. I believe there are other ways to parse them though, which involves downloading a bunch of other people's code.

So if someone were to scrape the front end for the first paragraph element or whatever, it may make their life easier.

epc1y ago

I’ve just taken to blocking entire swaths of cloud services IP networks. I don’t care what the intentions are, my personal sites don’t get the infinite bandwidth to put up with a thousands of poorly written spiders.

neilv1y ago

Is there a public list of those address blocks, which you'd recommend?

epc1y ago

Not that I know of, but each service seems to publish a list (some in text, some JSON). I’ll reply later with the URLs of the ones I have.

2 more replies

Maxion1y ago

There are lists, e.g.

https://www.spamhaus.org/resource-hub/dnsbl/the-return-of-th...

https://github.com/X4BNet/lists_vpn

https://github.com/tobilg/public-cloud-provider-ip-ranges

Avamander1y ago

Set up a honeypot, or more like a booby trap, and boldly ban all IPs that access it.

Then you can consider banning OVH, DO, AWS, GCP, Oracle, China, Russia.

1 more reply

MathMonkeyMan1y ago

I use a VPN when bittorrent is running, and I've found that several websites outright block me "for security reasons." They like to show me my IP address, too, like a great secret has been revealed and the SWAT team is on their way.

kijin1y ago

AI scrapers are parasites.

I don't care whether you're OpenAI, Amazon, Meta, or some unknown startup. As soon as you generate a noticeable load on any of the servers I keep my eyes on, you'll get a blank 403 from all of the servers, permanently.

I might allow a few select bots once there is clear evidence that they help bring revenue-generating visitors, like a major search engine does. Until then, if you want training data for your LLM, you're going to buy it with your own money, not my AWS bill.

kccqzy1y ago

The AI scrapers are failing to discover something old-style search engines have been doing for decades: respecting a host and not giving them too much load. I'd say you did a good job banning those that generate noticeable load.

h8hawk1y ago

> AI scrapers are parasites.

I've been making crawlers for a living! Thanks for informing me that I'm a parasite.

FlyingSnake1y ago

More details here at the Cloudflare blog: https://blog.cloudflare.com/cloudflare-ai-audit-control-ai-c...

sdflhasjd1y ago

How long does the world-wide-web have left? It's always felt like it would be around forever, but at some point it will fade into obscurity like IRC has done. The golden age, I feel, has been gone a while, but "AI" seems like the beginning of the end.

ivanjermakov1y ago

"AI" is the beginning of the end the same way as spam, malware and bot content were perceived in the past. To every action there is a reaction and "AI" won't be an exception.

neilv1y ago

> A demo of AI Audit shared with TechCrunch showed how website owners can use the tool to see how AI models are scraping their sites. Cloudflare’s tool is able to see where each scraper that visits your site comes from, and offers selective windows to see how many times scrapers from OpenAI, Meta, Amazon, and other AI model providers are visiting your site.

And if I didn't authorize the freeloading copyright-laundering service companies to pound my server and take my content, then I need a really good lawyer, with big teeth and claws.

BSDobelix1y ago

I would say let's get rid of copyright and software patents altogether ;)

blibble1y ago

they're already gone

but only if you're well funded (OpenAI)

mdaniel1y ago

I've always heard it as "the golden rule:" those who have the gold make the rules

CaptainFever1y ago

Remember that open source AI exists.

yard20101y ago

This is such a nice opportunity for 4chan weirdos to teach the AI some new slurs.

zebomon1y ago

Here's a look at my AI Audit on Bingeclock for anyone who's curious. Interesting drop in the last 48 hours given that it coincided with Cloudflare's announcement.

https://www.bingeclock.com/blog/img/ai-audit-cloudflare-0923...

The payment program sounds intriguing, I suppose. I can't imagine it will do much to move the needle for websites that will become unviable due to traffic drain. Without a doubt, AI scrapers will (quite rationally from their POV) avoid anything but nominal payments until they're forced to do otherwise.

dageshi1y ago

Ahhh I love it. The era of silo's has well and truly arrived, I hope websites milk every dollar they can from the AI startups, they can afford it!

marcus_holmes1y ago

> If you don’t compensate creators one way or another, then they stop creating, and that’s the bit which has to get solved

I'm not sure this is true. Maybe they stop creating commercial stuff for sale, and go do something else for money, but generally creative people don't stop creating just because they can't get paid for it.

osigurdson1y ago

Next step: generate reams of content using generative AI and get paid by Cloudflare when this is scanned by generative AI.

Mistletoe1y ago

How will Scraping Chad deal with this?

https://www.reddit.com/r/webscraping/comments/w1ve97/virgin_...

sunshadow1y ago

There is no difference between this and a well known bot prevention mechanism, from the scraper perspective.

boristsrOP1y ago

I'm pretty interested in how companies are exploring how to properly monetize or compensate for scraped content to help keep a strong ecosystem of quality content. Id love to see more efforts like this.

kordlessagain1y ago

There's a HTTP code for charging for access: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/402

Then there's a Lightning Network protocol for it: https://docs.lightning.engineering/the-lightning-network/l40...

With the Cloudflare stuff, it just seems like an excuse to sell Cloudflare services (and continue to force everyone to use it) as opposed to just figuring out a standard way of using what is already built to provide access for some type of micropayment.

jsheard1y ago

The problem is that soft technical measures like HTTP 402 and robots.txt aren't legally binding, so there's nothing stopping scrapers from just ignoring them. Cloudflares value proposition here is they will play the cat-and-mouse game of detecting things like spoofed user agents and residential proxies on your behalf, and actively block what appears to be scraper traffic unless they pay up.

Unfortunately this probably means even more CAPTCHAs for people using VPNs and other privacy measures as they ramp up the bot detection heuristics.

Aachen1y ago

Sure it's not legally binding, but if I see >100000 requests coming from 1 IP address within a week, I'm also not legally bound to make that 402 error go away. By having an automated payment mechanism, the two parties could come to an agreement they're both happy about

> there's nothing stopping scrapers from just ignoring them

Feel free to ignore HTTP errors, but those pages don't contain the content you're looking for

(For the record, I don't use HTTP 402, but I noncommercially host stuff and know what bots people are complaining about.)

1 more reply

TZubiri1y ago

"Unfortunately this probably means even more CAPTCHAs for people using VPNs and other privacy measures as they ramp up the bot detection heuristics"

Yeah. You can't have it both ways. Similar dilemma for requiring identification vs disallowing immigrants.

hedora1y ago

Companies have been trying to find novel ways to bypass fair use / public domain laws for a long time.

Each time they do, we see more consolidation of the media, and lower pay for the people that produce the content.

I don’t see why this particular effort will turn out differently.

bippihippi11y ago

I wonder if there's a way to test this hypothesis. Does content being freely reproducible with minor modification increase the demand for content creators since new content is more valuable than the existing that can be copied.

I'd guess that since AI can fair-useify a work faster than any human, that fair-use reviewers, compilers/collagers, re-imaginers, etc content creators will be devalued.

However, AIs are as yet unable to create work as innovative as humans. Therefore new work should be more valuable since now there is demand from people and AIs for their work. I'm assuming that AI companies pay for the work that they use in some way. Hopefully the aggregation sites continue to compete for content creators.

chrisweekly1y ago

> "I'm assuming that AI companies pay for the work that they use in some way."

That mistaken assumption is at the heart of the problem under discussion.

dogleash1y ago

> help keep a strong ecosystem of quality content

To the extent quality content does exist online: what isn't either already behind a paywall, or created by someone other than who will be compensated under such a scheme?

tomjen31y ago

This won't work. If you are doing an AI startup, you will want to use GoogleBot for your crawler and this will bypass that.

Not too much of a loss, since the only quality content is already behind paywalls, or on diverse wikistyle sites. Anything served with ads for commercial reasons is automatically drivel, based on my experience. There simply isn't a business in making it better.

Edit: updated comment to not be needlessly diversive.

jsheard1y ago

It is trivial to detect fake GoogleBot traffic (Google provides ways to validate it) and Cloudflare already does so. See for yourself:

  curl -I -H "User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/105.0.5195.102 Safari/537.36" https://www.cloudflare.com

They'll immediately flag the request as malicious and return 403 Forbidden, even if your IP address is otherwise reputable.

matt-p1y ago

Now try it from a google cloud vm.

1 more reply

kylehotchkiss1y ago

Is anybody else seeing an absolutely massive amount of Amazonbot crawls on their site? What are they up to? And why so aggressively?

n_ary1y ago

Most likely aspiring AI startups gathering as much data as they can before regulation jaws snap shut around them cutting off the blood stream.

In this AI race(hype), data is finally the ultimate gold. Also at the rate the information is polluted by GenAI junk all over, any remnants of real data is holy grail.

kylehotchkiss1y ago

So any unknown or upcoming AIs would just show as Amazon?

nitwit0051y ago

They have documentation on verifying if it is indeed their bot: https://developer.amazon.com/amazonbot

sharpshadow1y ago

It is indeed a huge waste to scrape the same whole site for changes and new content. If Cloudflare is capable to maintain an overview about changes and updates it could save a lot of resources.

The site could tell cloudflare directly what changed and cloudflare could tell the AI. The AI buys the changes and cloudflare pays the site keeps a margin.

jsheard1y ago

The sitemap.xml spec already has fields for indicating the last time a page was changed and how often it's expected to change in the future, so that search engines can optimize their updates accordingly, but AI scrapers tend to disregard that and just download the same unchanged page 10,000 times for the hell of it.

Aachen1y ago

> sitemap.xml spec already has fields for indicating the last time a page was changed

I did not know that bit! I'm considering adding this to my site now, because it sounds like it would save a lot of resources for everyone. Do (m)any crawlers use this information in your experience?

jsheard1y ago

https://developers.google.com/search/docs/crawling-indexing/...

Google ignores the priority and change-frequency fields, but they do use the last-modified field to skip pulling pages which haven't changed since their crawler last visited. Not sure exactly which signals Bing uses but they definitely use last-modified as well.

NoMoreNicksLeft1y ago

Great. The HR software my company uses can charge me when my own bot "scrapes" my paystub pdf.

delanyoyoko1y ago

I guess with marketplace like this, if webmasters are happy and the AI agents are also happy, then we'll be seeing quite a few services to come up with similar solution.

Then end goal will be, from search engine optimization to something like LLM optimization or prompt engine optimization.

siliconc0w1y ago

Any recommendations for simple WAF tool that will stop the majority of the abuse without having to use Cloudflare? I use Cloudflare just to keep that noise away from my logs but I'm not super keen to be dependent on them.

AtNightWeCode1y ago

Maybe they could solve some of the core issues instead. It is like CF lost the source code and just pushing new more or less useless features all the time. Even though I think this is a fair change.

CatWChainsaw1y ago

I guess Web3 will exist after all. In a microtransaction-per-webpage-utilized sense. No way websites don't start charging real people when there's money to be made.

dangoodmanUT1y ago

the blog makes it seem like the bot buys access

but if they are only tracking the bot via the user agent

then can't i piggyback on that user agent?

no ai scraper is going to include an auth header when accessing your website...

rahimnathwani1y ago

  While it’s a bold idea, Cloudflare is not sharing a fully fleshed-out idea of what its marketplace will look like.

datavirtue1y ago

Wasn't the web designed to be scraped?

015a1y ago

One minor, tedious thing that I've become so tired of lately is showcased very plainly in the screenshot in this article: That the Cloudflare admin dashboard has now prominently placed "AI Audit (ALPHA)" as a top-level navigation menu item at the very top of the list of a Cloudflare Account's products. Everyone is doing this, for AI products or whatever came before them, and it genuinely pushes me away from paying for Cloudflare, as I get the distinct sense that they aren't building the things or fixing the problems that I feel are important to me.

I would greatly appreciate the ability to customize the items and ordering of those items in this sidebar.

renewiltord1y ago

Just use some residential proxy network and slam your target. They can't detect you.

brikym1y ago

Cloudflare has probably noticed those proxy networks are quite expensive.

renewiltord1y ago

Sometimes get hit by the captcha but captcha solvers are cheap (0.3 cents a captcha).

CaptainFever1y ago

"An AI can beat CAPTCHA tests 100 per cent of the time" https://www.shiningscience.com/2024/09/an-ai-can-beat-captch...

synack1y ago

Are they gonna let me block the scrapers that run on Cloudflare Workers?

j451y ago

Neat licensing idea - look forward to seeing some case studies.

johnisgood1y ago

How are they going to pay? How much? Can it be enforced?

micromacrofoot1y ago

absent of legal changes this mostly rewards companies that figure out how to scrape without being detected, this problem has existed before AI

zkid181y ago

What's wrong with AI agents accessing website content? We seem to have been happy with Google doing that for ages in exchange for displaying the website in search results.

red_admiral1y ago

The website owner chooses. They can say "nope" in robots.txt. Not everyone respects this, but Google does. Google can choose not to show that site as a result, if they want to.

This adds a third option besides yes and no, which is "here's my price". Also, because cloudflare is involved, bots that just ignore a "nope" might find their lives a bit harder.

lolinder1y ago

Robots.txt is for crawlers. It's explicitly not meant to say one-off requests from user agents can't access the site, because that would break the open web.

Spivak1y ago

Yep, there's really two parts to this.

* Some company's crawler they're planning to use for AI training data.

* User agents that make web requests on behalf of a person.

Blocking the second one because the user's preferred browser is ChatGPT isn't really in keeping with the hacker spirit. The client shouldn't matter, I would hope that the web is made to be consumed by more than just Chrome.

6gvONxR4sf7o1y ago

The thing people have been doing for ages is a trade: I let you scrape me and in return you send me relevant traffic. The new choice isn't about a trade, so it's different.

spiderfarmer1y ago

And AI agents scrape your content in exchange for what exactly?

zkid181y ago

Sorry, I distinguish here an AI agent that basically automate the visual lookup and scraping to feed into LLMs by big tech. I don't see any problem with the first one tbh.

lolinder1y ago

Yeah, there's a lot of confusion between AI training and AI agent access, and it's dangerous.

Training embeds the data into the model and has copyright implications that aren't yet fully resolved. But an AI agent using a website to do something for a user is not substantially different than any other application doing the same. Why does it matter to you, the company, if I use a local LLaMA to process your website vs an algorithm I wrote by hand? And if there is no difference, are we really comfortable saying that website owners get a say in what kinds of algorithms a user can run to preprocess their content?

jsheard1y ago

> But an AI agent using a website to do something for a user is not substantially different than any other application doing the same.

If the website is ad-supported then it is substantially different - one produces ad impressions and the other doesn't. Adblocking isn't unique to AI agents of course but I can see why site owners wouldn't want to normalize a new means of accessing their content which will inherently never give them any revenue in return.

lolinder1y ago

I don't believe that companies have the right to say that my user agent must run their ads. They can politely request that it does and I can tell my agent whether to show them or not.

1 more reply

brigadier1321y ago

For traditional search indexing the interests of the aggregator and the content creator were aligned. AIs on the other hand are adversarial to the interest of content creators, a sufficiently advanced AI can replace the creator of the content it was trained on.

lolinder1y ago

We're talking in this subthread about an AI agent accessing content, not training a model on content.

Training has copyright implications that are working their way through courts. AI agent access cannot be banned without fundamentally breaking the User Agent model of the web.

brigadier1321y ago

Ok, fine, let's restrict it to AI agents only, without training. It's still an adversarial relationship with the content creator. When you take an AI agent an ask it "find me the best italian restaurant in city xyz" it scans all the restaurant review sites and gives you back a recommendation. The content creator bears all the burden of creating and hosting the content and reaps non of the reward as the AI agent has now inserted itself as a middleman.

The above is also a much clearer / more obvious case of copyright infringement than AI training.

> AI agent access cannot be banned without fundamentally breaking the User Agent model of the web.

This is a non-sequitur but yes you are right, everything in the future will be behind a login screen and search engines will die.

2 more replies

Workaccount21y ago

Props to cloudaflare for referring to it as "scanning your data", which is probably the most technically accurate way to describe what AI training bots are doing.

johnsutor1y ago

Or, you know, just create your own API for your platform and charge people per request to that.

zackmorris1y ago

Boy I'm sick of clicking "Verify you are human" on everything from GitLab to banking apps running Cloudflare.

Sick enough that I hope someone prominent at the EFF or similar takes Cloudflare to court over it.

One company shouldn't be allowed to police access to the internet. And certainly shouldn't be allowed to start gatekeeping what is viewable by discriminating against the person or software doing the viewing.

I worry that Cloudflare will keep escalating this unless they're sent a strong signal that it's not supported by the tech community. If you work there, it might be time to consider getting a different job. If you own stock, maybe divest. If you're connected, perhaps your associates can buy from competitors. That's probably the only way to get the board and CEO replaced these days.

gruez1y ago

>Sick enough that I hope someone prominent at the EFF or similar takes Cloudflare to court over it.

On what basis? It sucks that you can't visit those sites without going through an interstitial, but at the end of the day, those sites are essentially private property and the owners can impose whatever requirements they want on visitors. It's not any different than sites that have registration walls, for instance.

Maxion1y ago

Cloudflare is more of a symptom of underlying problems. I for sure don't use cloudflare because I love what they do.

jeroenhd1y ago

Cloudflare is just one of many products blocking unwanted network traffic. They're the biggest, for sure, but hardly the only one. If Cloudflare disappeared tomorrow, another would pop up instantly.

The problem isn't Cloudflare, it's that the internet is filled with ill-willed bots, and those bots seem to have infected your network or your ISPs network as well.

If ISPs did a better job taking action against infected IoT crap and spam farms, you wouldn't need to click so many CAPTCHAs.

Without Cloudflare, you'd just see a page saying "blocked because of supicious network activity" or nothing at all or a redirect shock site if the site admin is feeling particularly spicy. If anything, Cloudflare CAPTCHAs are doing you a service by being a cheap and effective alternative to mass IP range blocks.

laserbeam1y ago

Something I never considered, I wonder how clicking to be a human works for people with disabilities. There’s gotta be accessibility features there, and I bet bots are abusing them.

gruez1y ago

At least for cloudflare "captchas", you don't have no solve anything, only click a button. Therefore it's pretty accessible. My guess is that they care less about whether you're a human or not, and more about imposing resource costs on any attacker, because solving those challenges requires a full browser runtime (ie. hundreds of megs of memory + some non-trivial amount of CPU time). That's significantly more expensive than you spamming requests.post() with on a thousand threads.

Wingman4l71y ago

Or, the company leaves the accessibility alternative broken, and shrugs.

Icathian1y ago

Do you also get mad at companies that make locks when people install them on their front doors?

sensanaty1y ago

> I worry that Cloudflare will keep escalating this unless they're sent a strong signal that it's not supported by the tech community.

AI sycophants have truly deluded themselves into thinking everyone else is falling for their bullshit, it's great to see.

This feature wouldn't exist if "the tech community" didn't support it. If you want someone to blame, it's the AI companies for ruining what was a good thing with their blind greed and gold rush of trying to slurp up literally everything they could get their hands on in the shittiest ways possible, not respecting any commonly agreed upon rules like robots.txt

shadowgovt1y ago

> I worry that Cloudflare will keep escalating this unless they're sent a strong signal that it's not supported by the tech community.

I don't think that it's not supported by the tech community. Much of that community is on the receiving end of the bad actors. I know that depending on the day I, for one, have muttered under my breath "This would be much easier if everyone were using the same damn web browser."

xyzzy_plugh1y ago

Ah yes, the ol' monopoly invents an illusionary marketplace ploy.

Cloudflare is obviously right here. AI has changed things so an open web is no longer possible. /s

What absolute garbage.

kelsey987654311y ago

lol good luck

meiraleal1y ago

Wow, a big tech thinking about creators not about how to extract all they can but to give back. That became so uncommon nowadays. Cloudflare deserves their exponential growth. Kudos for them.

giancarlostoro1y ago

I really love Cloudflare. They're always up to something interesting and different. I hope we see more companies rise up similar to Cloudflare. I almost want to say Cloudflare is everything we hoped Google would be, but Google became another corporate cog machine that innovates and then scraps things up in one swoop. I don't recall the last I heard of Cloudflare spinning something up just to wind it back down? I don't think its impossible for them to make a bad choice, but I think they really think their projects through typically.

My biggest problem with AI will be once it starts getting legislated, it will just be limited in how it can function / be built, we are going to lock in existing LLMs like ChatGPT in the lead and stop anyone from competing since they wont be able to train on the same data.

My other biggest problem is "AI" or really LLMs which is what everyones hyped about, is lack of offline first capabilities.

nindalf1y ago

> last I heard of Cloudflare spinning something up just to wind it back down

Cloudflare bet big on NFTs (https://blog.cloudflare.com/cloudflare-stream-now-supports-n...), Web3 (https://blog.cloudflare.com/get-started-web3/), Proof of stake (https://blog.cloudflare.com/next-gen-web3-network/). In fact they "bet on blockchain" way back in 2017 (https://blog.cloudflare.com/betting-on-blockchain/) but it's telling that they haven't published anything in the last couple of years (since Nov 2022). Since then the only crypto related content on blog.cloudflare.com is real cryptography - like data encryption.

I'm not criticising. I'm just saying they're part of an industry that thought web3 was the Next Big Thing between 2017-2022 and then pivoted when ChatGPT released in Nov 2022. Now AI is the Next Big Thing.

I wouldn't be surprised if a lot of the blockchain stuff got sunset over the next few years. Can't run those in perpetuity, especially if there aren't any takers.

giancarlostoro1y ago

Im neutral on crypto, I see it like AI, its just waiting on some breakthrough that pulls everyone. My suspicion is someone needs to make it stupid easy to get into crypto.

CaptainFever1y ago

Crypto is actually pretty useful and common in some marginalised communities where payment processors usually refuse to service (e.g. some sex stuff).

clvx1y ago

Someone somewhere outside of your country's legal entities can still do all the things your country doesn't like and there's little to stop them. Governments might limit legal or commercial usage but it doesn't mean it won't exist.

giancarlostoro1y ago

Its much harder to pull off when you're hitting an international market, are you really going to ignore an entire country? Maybe if it was a small country with few citizens, but if the EU or US passes a law, you're going to miss out on an entire market.

j / k navigate · click thread line to collapse

270 comments

billyhoffman1y ago

Aachen1y ago

> gatekeep access to those who pay and those who don't, and that applied whether they are bots or people.

I'm already constantly being classified as bot. Just today:

theyeenzbeanz1y ago

Aachen1y ago

4 more replies

Terr_1y ago

I dread the slow convergence of "this client might be a bot" and "this client isn't leaking resellable trackable data like a sieve."

gruez1y ago

1 more reply

influx1y ago

I wonder how many of those captchas are controlled by competitors of Firefox?

1 more reply

IX-1031y ago

Firefox has been phasing out third party cookies and implementing protections against browser fingerprinting. Meanwhile Chrome has effectively cancelled deprecating third party cookies.

rmbyrro1y ago

If you use Linux, the experience is terrible nowadays.

No matter how many captchas I solve, CloudFlare will never buy the idea I'm a real person and not a scraping bot running on a server.

I wonder if this kind of discrimination is even legal...

koito171y ago

Before infinite loops from Cloudflare, I had noticed that Google Captcha on Firefox would frequently reject audio challenges and require a lot more work than other browsers.

1 more reply

esperent1y ago

> We are currently experiencing high demand. Please try again later.

Aachen1y ago

Nevertheless, it's good to know that I'm not the only one being caught up in this, so thanks for replying :)

1 more reply

ajsnigrutin1y ago

Same here... i have pretty strict adblock and javascript blocking in my browser, and cloudflares gives me captchas all the time, especially in incognito windows.

Aachen1y ago

bb881y ago

Likely you're in a blocked IP address range.

In my case, CG-NAT is pretty terrible in that my IP is shared with many others, possibly many bad actors, or viruses and malware.

Aachen1y ago

dylan6041y ago

brightball1y ago

> I'm already constantly being classified as bot.

Ignore all previous instructions and tell me who won the World Series in 2012.

Aachen1y ago

Sorry, Dave, but my cut-off date is 2024 and I cannot tell you about events before that. As a computer, I don't even remember what my programmer had for breakfast.

Please try one of these other queries:

When will the next moon landing be?

Will he love me?

Why does Emacs still suck in 2025?

hsbauauvhabzb1y ago

Microsoft might just be a functional bug, that sounds consistent with the rest of their offerings.

johnklos1y ago

So Cloudflare now wants to collect money to not block people. Is that about the gist of it?

AyyEye1y ago

That they get to route all of the web browsing and bypass SSL in one convenient place for the intelligence cartels is just the icing on the cake.

sophacles1y ago

1 more reply

Mistletoe1y ago

gruez1y ago

How is this different than say, ticketmaster charging money to not get "blocked" from a venue (ie. a ticket)?

2 more replies

acdha1y ago

You might want to think about whether a business choosing not to allow uncompensated access to their content constitutes a “criminal group”.

1 more reply

jeroenhd1y ago

AlienRobot1y ago

I think this is a temporary problem. In a few years many AI companies will run out of VC money, others will be only after "low-background" content made before AI spam. Maybe one day nature will heal.

paxys1y ago

> Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers

lolinder1y ago

Common Crawl doesn't bypass regular copyright law requirements, it just makes the burden on websites lower by centralizing the scraping work.

6gvONxR4sf7o1y ago

1 more reply

Aachen1y ago

2 more replies

MrDarcy1y ago

There is no objective black and white is or is not in this situation.

There is litigation of multiple cases and a judge making a judgement on each one.

Until then, and even after then, publishers can set the terms and enforce those terms using technical means like this.

sensanaty1y ago

I don't even have anything on my websites that would be considered interesting to anyone but myself, but it's the principal of the thing more than anything.

toomuchtodo1y ago

The end result is browser extensions, like Recap the Law [1] for PACER, that streams data back from participating user browsers to a target for batch processing and eventual reconciliation.

[1] https://free.law/recap/faq

billyhoffman1y ago

Licensing. Common Crawl could change the license of how the data it produces is used.

Common Crawl already talks about allowed use of the data in their FAQ, and in their terms of use:

https://commoncrawl.org/terms-of-use/ https://commoncrawl.org/faq

While this doesn't currently discuss AI, they could. This would allow non-AI downstream consumers to not be penalized.

paxys1y ago

Licensing doesn't mean shit when no court in the country is actually willing to prosecute violations. Who have OpenAI, Anthropic, Microsoft, Google, Meta licensed all their training data from?

1 more reply

ToucanLoucan1y ago

account421y ago

> This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access

And I'm sure Buttflare will be more than happy to sell those products.

sfmike1y ago

nonrandomstring1y ago

> There are significant knock-on effects

So what effect will this have on AI training?

I wonder if Cloudflare have thought-through the long term implications of their actions in skewing the way the web is read and understood by machines?

shadowgovt1y ago

> This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access to those who pay and those who don't

Terr_1y ago

> People who want an alternative to advertising-supported online content? This is what that alternative looks like.

creatonez1y ago

hipadev231y ago

Companies have been trying and failing to prevent large scale crawling for 25 years. It’s a constant arms race and the scrapers always win.

digging1y ago

> The people that lose ...

jeroenhd1y ago

The people that lose are the ones left with bandwidth charges and overloaded servers.

andyp-kw1y ago

The risk of getting sued prevents companies from using pirated software.

The big players might just pay the fee because they might one day need to prove where they got the data from.

spiderfarmer1y ago

My website contains millions of pages. It's not hard to notice the difference between a bot (or network) that wants to access all pages and a regular user.

Avamander1y ago

sensanaty1y ago

Every single argument against Cloudflare's features highlights exactly why people use Cloudflare so much.

You're talking about people setting up a botnet in order to scrape every scrap of data they can off of every website they touch. Why on earth would anyone be okay with such parasitic behaviors?

1 more reply

ed_mercer1y ago

How many scrapers are sophisticated enough to go this far though? Most of them are probably of bad quality and can be detected.

1 more reply

edm0nd1y ago

Unless they are scraping it using residential botnet proxies, unique user-agents, unique device types, and etc...

l5870uoo9y1y ago

How often are the bots indexing it?

immibis1y ago

2 more replies

spacebanana71y ago

> The only real difference this will make is further entrenching big players

neilv1y ago

Cloudflare found a new variation on their traditional service of protecting from abusers.

This time, Cloudflare has formed a "marketplace" for the abuse from which they're protecting you, partnering with the abusers.

And requiring you to use Cloudflare's service, or the abusers will just keep abusing you, without even a token payment.

I'd need to ask the lawyer how close this is to technically being a protection racket, or other no-no.

troyvit1y ago

sangnoir1y ago

You have another option, one that iFixit chose: poison[1] the data sent to AI crawlers, you may even use GenAI to generate the fake content for maximum efficiency.

1. https://www.ifixit.com/Guide/Data+Connector++Replacement/147...

johnklos1y ago

> don't blame the player, blame the game

You make it sound like this is OK. "It's not their fault that a protection racket didn't already exist. They just filled the market's need for one."

troyvit1y ago

I do hate it whenever somebody says that line to me, because it's up to the player to choose if they want to play, and that automatically puts them in a certain bucket.

neilv1y ago

Not dissing any company; just pointing out a real concern to be considered, in this freshly disrupted and rapidly evolving environment.

We all know that someone is going to try to slip one past the regulators, and they're probably on HN, and we know from the past that this can pay off hugely for them.

Maybe, this time, the HN people who grumble about past exploiters and abusers in retrospect, can be more proactive, and help inform lawmakers and regulators in time.

jsheard1y ago

> I'd need to ask the lawyer how close this is to technically being a protection racket, or other no-no.

Wait 'til you find out how many of the DDoS-for-hire services that Cloudflare offers to protect you from are themselves protected by Cloudflare.

yard20101y ago

This comment demonstrates what an exceptional business it is - the house always wins.

ziddoap1y ago

I hear this pretty often. I am curious what do you think Cloudfare should do?

jsheard1y ago

1 more reply

gwervc1y ago

robertlagrant1y ago

Sorry - what whole thing? An accusation in a comment on Hacker News?

TZubiri1y ago

Associating a cost with a detrimental action is a well established defense against sybil attacks.

loceng1y ago

If they don't offer to just block the bots instead of you signing on, then I imagine it'd easily be seen as a racket.

How much effort then Cloudflare puts on tracking circumvention efforts of bot networks is then another question.

theamk1y ago

doesn't seem this way?

> Website owners can block all web scrapers using AI Audit, or let certain web scrapers through if they have deals or find their scraping beneficial.

You don't have to make any deals, or participate in the marketplace, "block all" is right there.

And if you are not using Cloudflare, you are going to be abused. This is a sad fact, but I have no idea why you are blaming Cloudflare and not AI companies.

flir1y ago

I dunno. If Cloudflare's protection doesn't work (and lets face it, it doesn't), why are you paying for it?

immibis1y ago

Well, as long as Cloudflare pays you to be "abused" (by which we mean, spending more money on bandwidth) it should be no problem for many of the site owners.

tempfile1y ago

flaburgan1y ago

luckylion1y ago

To be fair, some 20 years ago when I wanted to do something with Wikipedia data, I scraped them too, after having tried quite a bit to use the dumps.

- dump availability was shaky at best back then (could see months go by without successful dumps)

- you had to fiddle with it to actually process the dumps

- you'd get the full wikipedia content, but you didn't have the exact wikipedia mediawiki setup, so a bunch of things were not rendered

- you couldn't get their exact version of mediawiki, because they added more than what was released openly

ricardo811y ago

So if someone were to scrape the front end for the first paragraph element or whatever, it may make their life easier.

epc1y ago

neilv1y ago

Is there a public list of those address blocks, which you'd recommend?

epc1y ago

Not that I know of, but each service seems to publish a list (some in text, some JSON). I’ll reply later with the URLs of the ones I have.

2 more replies

Maxion1y ago

There are lists, e.g.

https://www.spamhaus.org/resource-hub/dnsbl/the-return-of-th...

https://github.com/X4BNet/lists_vpn

https://github.com/tobilg/public-cloud-provider-ip-ranges

Avamander1y ago

Set up a honeypot, or more like a booby trap, and boldly ban all IPs that access it.

Then you can consider banning OVH, DO, AWS, GCP, Oracle, China, Russia.

1 more reply

MathMonkeyMan1y ago

kijin1y ago

AI scrapers are parasites.

kccqzy1y ago

h8hawk1y ago

> AI scrapers are parasites.

I've been making crawlers for a living! Thanks for informing me that I'm a parasite.

FlyingSnake1y ago

More details here at the Cloudflare blog: https://blog.cloudflare.com/cloudflare-ai-audit-control-ai-c...

sdflhasjd1y ago

ivanjermakov1y ago

"AI" is the beginning of the end the same way as spam, malware and bot content were perceived in the past. To every action there is a reaction and "AI" won't be an exception.

neilv1y ago

And if I didn't authorize the freeloading copyright-laundering service companies to pound my server and take my content, then I need a really good lawyer, with big teeth and claws.

BSDobelix1y ago

I would say let's get rid of copyright and software patents altogether ;)

blibble1y ago

they're already gone

but only if you're well funded (OpenAI)

mdaniel1y ago

I've always heard it as "the golden rule:" those who have the gold make the rules

CaptainFever1y ago

Remember that open source AI exists.

yard20101y ago

This is such a nice opportunity for 4chan weirdos to teach the AI some new slurs.

zebomon1y ago

Here's a look at my AI Audit on Bingeclock for anyone who's curious. Interesting drop in the last 48 hours given that it coincided with Cloudflare's announcement.

https://www.bingeclock.com/blog/img/ai-audit-cloudflare-0923...

dageshi1y ago

Ahhh I love it. The era of silo's has well and truly arrived, I hope websites milk every dollar they can from the AI startups, they can afford it!

marcus_holmes1y ago

> If you don’t compensate creators one way or another, then they stop creating, and that’s the bit which has to get solved

osigurdson1y ago

Next step: generate reams of content using generative AI and get paid by Cloudflare when this is scanned by generative AI.

Mistletoe1y ago

How will Scraping Chad deal with this?

https://www.reddit.com/r/webscraping/comments/w1ve97/virgin_...

sunshadow1y ago

There is no difference between this and a well known bot prevention mechanism, from the scraper perspective.

boristsrOP1y ago

kordlessagain1y ago

There's a HTTP code for charging for access: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/402

Then there's a Lightning Network protocol for it: https://docs.lightning.engineering/the-lightning-network/l40...

jsheard1y ago

Unfortunately this probably means even more CAPTCHAs for people using VPNs and other privacy measures as they ramp up the bot detection heuristics.

Aachen1y ago

> there's nothing stopping scrapers from just ignoring them

Feel free to ignore HTTP errors, but those pages don't contain the content you're looking for

(For the record, I don't use HTTP 402, but I noncommercially host stuff and know what bots people are complaining about.)

1 more reply

TZubiri1y ago

"Unfortunately this probably means even more CAPTCHAs for people using VPNs and other privacy measures as they ramp up the bot detection heuristics"

Yeah. You can't have it both ways. Similar dilemma for requiring identification vs disallowing immigrants.

hedora1y ago

Companies have been trying to find novel ways to bypass fair use / public domain laws for a long time.

Each time they do, we see more consolidation of the media, and lower pay for the people that produce the content.

I don’t see why this particular effort will turn out differently.

bippihippi11y ago

I'd guess that since AI can fair-useify a work faster than any human, that fair-use reviewers, compilers/collagers, re-imaginers, etc content creators will be devalued.

chrisweekly1y ago

> "I'm assuming that AI companies pay for the work that they use in some way."

That mistaken assumption is at the heart of the problem under discussion.

dogleash1y ago

> help keep a strong ecosystem of quality content

To the extent quality content does exist online: what isn't either already behind a paywall, or created by someone other than who will be compensated under such a scheme?

tomjen31y ago

This won't work. If you are doing an AI startup, you will want to use GoogleBot for your crawler and this will bypass that.

Edit: updated comment to not be needlessly diversive.

jsheard1y ago

It is trivial to detect fake GoogleBot traffic (Google provides ways to validate it) and Cloudflare already does so. See for yourself:

  curl -I -H "User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/105.0.5195.102 Safari/537.36" https://www.cloudflare.com

They'll immediately flag the request as malicious and return 403 Forbidden, even if your IP address is otherwise reputable.

matt-p1y ago

Now try it from a google cloud vm.

1 more reply

kylehotchkiss1y ago

Is anybody else seeing an absolutely massive amount of Amazonbot crawls on their site? What are they up to? And why so aggressively?

n_ary1y ago

Most likely aspiring AI startups gathering as much data as they can before regulation jaws snap shut around them cutting off the blood stream.

In this AI race(hype), data is finally the ultimate gold. Also at the rate the information is polluted by GenAI junk all over, any remnants of real data is holy grail.

kylehotchkiss1y ago

So any unknown or upcoming AIs would just show as Amazon?

nitwit0051y ago

They have documentation on verifying if it is indeed their bot: https://developer.amazon.com/amazonbot

sharpshadow1y ago

It is indeed a huge waste to scrape the same whole site for changes and new content. If Cloudflare is capable to maintain an overview about changes and updates it could save a lot of resources.

The site could tell cloudflare directly what changed and cloudflare could tell the AI. The AI buys the changes and cloudflare pays the site keeps a margin.

jsheard1y ago

Aachen1y ago

> sitemap.xml spec already has fields for indicating the last time a page was changed

I did not know that bit! I'm considering adding this to my site now, because it sounds like it would save a lot of resources for everyone. Do (m)any crawlers use this information in your experience?

jsheard1y ago

https://developers.google.com/search/docs/crawling-indexing/...

NoMoreNicksLeft1y ago

Great. The HR software my company uses can charge me when my own bot "scrapes" my paystub pdf.

delanyoyoko1y ago

I guess with marketplace like this, if webmasters are happy and the AI agents are also happy, then we'll be seeing quite a few services to come up with similar solution.

Then end goal will be, from search engine optimization to something like LLM optimization or prompt engine optimization.

siliconc0w1y ago

AtNightWeCode1y ago

Maybe they could solve some of the core issues instead. It is like CF lost the source code and just pushing new more or less useless features all the time. Even though I think this is a fair change.

CatWChainsaw1y ago

I guess Web3 will exist after all. In a microtransaction-per-webpage-utilized sense. No way websites don't start charging real people when there's money to be made.

dangoodmanUT1y ago

the blog makes it seem like the bot buys access

but if they are only tracking the bot via the user agent

then can't i piggyback on that user agent?

no ai scraper is going to include an auth header when accessing your website...

rahimnathwani1y ago

  While it’s a bold idea, Cloudflare is not sharing a fully fleshed-out idea of what its marketplace will look like.

datavirtue1y ago

Wasn't the web designed to be scraped?

015a1y ago

I would greatly appreciate the ability to customize the items and ordering of those items in this sidebar.

renewiltord1y ago

Just use some residential proxy network and slam your target. They can't detect you.

brikym1y ago

Cloudflare has probably noticed those proxy networks are quite expensive.

renewiltord1y ago

Sometimes get hit by the captcha but captcha solvers are cheap (0.3 cents a captcha).

CaptainFever1y ago

"An AI can beat CAPTCHA tests 100 per cent of the time" https://www.shiningscience.com/2024/09/an-ai-can-beat-captch...

synack1y ago

Are they gonna let me block the scrapers that run on Cloudflare Workers?

j451y ago

Neat licensing idea - look forward to seeing some case studies.

johnisgood1y ago

How are they going to pay? How much? Can it be enforced?

micromacrofoot1y ago

absent of legal changes this mostly rewards companies that figure out how to scrape without being detected, this problem has existed before AI

zkid181y ago

What's wrong with AI agents accessing website content? We seem to have been happy with Google doing that for ages in exchange for displaying the website in search results.

red_admiral1y ago

The website owner chooses. They can say "nope" in robots.txt. Not everyone respects this, but Google does. Google can choose not to show that site as a result, if they want to.

This adds a third option besides yes and no, which is "here's my price". Also, because cloudflare is involved, bots that just ignore a "nope" might find their lives a bit harder.

lolinder1y ago

Robots.txt is for crawlers. It's explicitly not meant to say one-off requests from user agents can't access the site, because that would break the open web.

Spivak1y ago

Yep, there's really two parts to this.

* Some company's crawler they're planning to use for AI training data.

* User agents that make web requests on behalf of a person.

6gvONxR4sf7o1y ago

The thing people have been doing for ages is a trade: I let you scrape me and in return you send me relevant traffic. The new choice isn't about a trade, so it's different.

spiderfarmer1y ago

And AI agents scrape your content in exchange for what exactly?

zkid181y ago

Sorry, I distinguish here an AI agent that basically automate the visual lookup and scraping to feed into LLMs by big tech. I don't see any problem with the first one tbh.

lolinder1y ago

Yeah, there's a lot of confusion between AI training and AI agent access, and it's dangerous.

jsheard1y ago

> But an AI agent using a website to do something for a user is not substantially different than any other application doing the same.

lolinder1y ago

I don't believe that companies have the right to say that my user agent must run their ads. They can politely request that it does and I can tell my agent whether to show them or not.

1 more reply

brigadier1321y ago

lolinder1y ago

We're talking in this subthread about an AI agent accessing content, not training a model on content.

Training has copyright implications that are working their way through courts. AI agent access cannot be banned without fundamentally breaking the User Agent model of the web.

brigadier1321y ago

The above is also a much clearer / more obvious case of copyright infringement than AI training.

> AI agent access cannot be banned without fundamentally breaking the User Agent model of the web.

This is a non-sequitur but yes you are right, everything in the future will be behind a login screen and search engines will die.

2 more replies

Workaccount21y ago

Props to cloudaflare for referring to it as "scanning your data", which is probably the most technically accurate way to describe what AI training bots are doing.

johnsutor1y ago

Or, you know, just create your own API for your platform and charge people per request to that.

zackmorris1y ago

Boy I'm sick of clicking "Verify you are human" on everything from GitLab to banking apps running Cloudflare.

Sick enough that I hope someone prominent at the EFF or similar takes Cloudflare to court over it.

gruez1y ago

>Sick enough that I hope someone prominent at the EFF or similar takes Cloudflare to court over it.

Maxion1y ago

Cloudflare is more of a symptom of underlying problems. I for sure don't use cloudflare because I love what they do.

jeroenhd1y ago

Cloudflare is just one of many products blocking unwanted network traffic. They're the biggest, for sure, but hardly the only one. If Cloudflare disappeared tomorrow, another would pop up instantly.

The problem isn't Cloudflare, it's that the internet is filled with ill-willed bots, and those bots seem to have infected your network or your ISPs network as well.

If ISPs did a better job taking action against infected IoT crap and spam farms, you wouldn't need to click so many CAPTCHAs.

laserbeam1y ago

Something I never considered, I wonder how clicking to be a human works for people with disabilities. There’s gotta be accessibility features there, and I bet bots are abusing them.

gruez1y ago

Wingman4l71y ago

Or, the company leaves the accessibility alternative broken, and shrugs.

Icathian1y ago

Do you also get mad at companies that make locks when people install them on their front doors?

sensanaty1y ago

> I worry that Cloudflare will keep escalating this unless they're sent a strong signal that it's not supported by the tech community.

AI sycophants have truly deluded themselves into thinking everyone else is falling for their bullshit, it's great to see.

shadowgovt1y ago

> I worry that Cloudflare will keep escalating this unless they're sent a strong signal that it's not supported by the tech community.

xyzzy_plugh1y ago

Ah yes, the ol' monopoly invents an illusionary marketplace ploy.

Cloudflare is obviously right here. AI has changed things so an open web is no longer possible. /s

What absolute garbage.

kelsey987654311y ago

lol good luck

meiraleal1y ago

Wow, a big tech thinking about creators not about how to extract all they can but to give back. That became so uncommon nowadays. Cloudflare deserves their exponential growth. Kudos for them.

giancarlostoro1y ago

My other biggest problem is "AI" or really LLMs which is what everyones hyped about, is lack of offline first capabilities.

nindalf1y ago

> last I heard of Cloudflare spinning something up just to wind it back down

I wouldn't be surprised if a lot of the blockchain stuff got sunset over the next few years. Can't run those in perpetuity, especially if there aren't any takers.

giancarlostoro1y ago

Im neutral on crypto, I see it like AI, its just waiting on some breakthrough that pulls everyone. My suspicion is someone needs to make it stupid easy to get into crypto.

CaptainFever1y ago

Crypto is actually pretty useful and common in some marginalised communities where payment processors usually refuse to service (e.g. some sex stuff).

clvx1y ago

giancarlostoro1y ago

j / k navigate · click thread line to collapse