Miasma: A tool to trap AI web scrapers in an endless poison pit (opens in new tab)

(github.com)

346 pointsLucidLynx1mo ago247 comments

247 comments

I dunno... it feels like the same approach as those people who tell you gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced.

Also, inserting hidden or misleading links is specifically a no-no for Google Search [0], who have this to say: We detect policy-violating practices both through automated systems and, as needed, human review that can result in a manual action. Sites that violate our policies may rank lower in results or not appear in results at all.

So you may well end up doing more damage to your own site than to the bots by using dodgy links in this manner.

[0]https://developers.google.com/search/docs/essentials/spam-po...

trinsic21mo ago

>I dunno... it feels like the same approach as those people who tell you gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced

If you are automating it, I don't see why not. Kitboga, a you-tuber kept scam callers in AI call-center loops tying up there resources so they cant use them on unsuspecting victims.[0]

That's a guerilla tactic, similar in warfare, when you steal resources from an enemy, you get stronger and they get weaker, its pretty effective.

[0]: https://www.youtube.com/watch?v=ZDpo_o7dR8c

phplovesong1mo ago

Pretty easy. Get a paid number and have the phone scammers / marketers call that. I know a guy who made a decent side huzzle from this. They marketers slowly blocked his number tho, not sure if he still has this thing going on, as it was more a experiment.

yareally1mo ago

Was he picking up the phone and telling them to call him back on the other number?

1 more reply

sysguest1mo ago

> Get a paid number

how? I'm interested

1 more reply

bdangubic1mo ago

more and more scammers are automating their side as well so soon the loop will be just bots talking to bots

Saline95151mo ago

The dead phone theory?

1 more reply

rogerrogerr1mo ago

> gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced.

It’s one of the best time investments I’ve ever made. They just don’t call me anymore.

I think they have two lists: the “do not call” list, and the “unprofitable to call” list. You want to be on the latter list.

0x696C69611mo ago

From my experience, it's the opposite. The more you fuck with them the more they call. It's better not to answer. But I just can't help myself.

rgblambda1mo ago

I'm guessing they might only know how long they had you on the phone per call and be oblivious to the fact you're intentionally wasting their time. I suppose you're still tying down a person who could be otherwise be genuinely scamming someone.

rogerrogerr1mo ago

I got more calls at first too. I assume they were selling my number to the next scum in line. But then it just stopped.

1 more reply

iririririr1mo ago

yes it work.

phone scammers have a very high personel cost, hence why some resort for human traffic.

if everyone picked up the phone and wasted a few seconds, it would be enough to make their whole enterprise worthless. but since most people who would not fail shutdown right away, they have the best ROI of any industry. they don't even pay the call for first seconds.

chongli1mo ago

Also, inserting hidden or misleading links is specifically a no-no for Google Search [0]

Depending on your goals, this may be a pro or a con. I, personally, would like to see a return of "small web" human-centric communities. If there were tools that include anti-scraping, anti-Google (and other large search crawlers) as well as a small web search index for humans to find these sites, this idea becomes a real possibility.

maxrmk1mo ago

It’s easy to opt out of being indexed by Google.

cdrini1mo ago

Exactly. Identifying crawlers like Google, bing aren't the issue. They obey robots.txt, and can easily be blocked by user agent checks. Non-identifying crawlers, which provide humanlike user agents, and which are usually distributed so get around ip-based rate limits, are the main ones that are challenging to deal with.

ordu1mo ago

> it feels like the same approach as those people who tell you gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced.

In 2000s there was some company in Russia selling English courses. It spammed so much, that people were really pissed off. To make long story short, the company disappeared from a public space when Golden Telecom joined the party of retaliatory "spam" calls and make computer to call the company using Golden Telecom modem pool.

So, yeah, you kinda can achieve something in this way, but to make sure you should lease a modem pool for that.

xyzal1mo ago

One would assume legit spiders obey robots.txt.

lolc1mo ago

This, to me, is the strongest argument to offer these slop generators. It provides an incentive to follow the robots.txt.

xyzal1mo ago

Exactly. You disobey robots file => we'll make your crawl gain a net negative.

bugfix1mo ago

I really don't get it. Wouldn't you be wasting a lot of resources feeding the bots like this?

with1mo ago

> do these types of techniques really work?

They have been proven to: https://www.anthropic.com/research/small-samples-poison

TurdF3rguson1mo ago

It might work for a very basic bot that doesn't understand how scraping to infinite depth is not very good idea. It won't be effective against anything minimally sophisticated.

phplovesong1mo ago

Who TF cares about google? This is mostly for personal tech stuff (just the stuff AI steals for training). Id say its pretty welcome that it is not shown in google results.

throw109201mo ago

> I’m not convinced.

Is this how low we've sunk - that even below taking a single personal anecdote and generalizing it to everything - now we're taking zero experience and dismissing things based on vibes?

I've seen lots of LLM-slop-lovers doing the same thing. Maybe it's a pattern.

ozozozd1mo ago

We are way lower. At least this comment allows uncertainty.

The AI doomer literature is entirely from an armchair, with 100% certainty about the outcome, high confidence predictions about its timing. It’s literally fiction.

deadbabe1mo ago

Honestly, I’m starting to not give a fuck about ranking on Google.

Google searches have become incredibly devalued for me in the age of LLMs. ChatGPT is pretty much my first and often only stop on a quest for some answers.

If you have a website, you must promote it via other ways that don’t involve Google.

tasuki1mo ago

> If you have a public website, they are already stealing your work.

I have a public website, and web scrapers are stealing my work. I just stole this article, and you are stealing my comment. Thieves, thieves, and nothing but thieves!

margalabargala1mo ago

The problem I have, is they hammer my site so hard they take it down.

The content is for everyone. They can have it. Just don't also take it away from everybody else.

ethmarks1mo ago

Unintentional denial-of-service attacks from AI scrapers are definitely a problem, I just don't know if "theft" is the right way to classify them. They shouldn't get lumped in with intellectual property concerns, which are a different matter. AI scrapers are a tragedy of the commons problem kind of like Kessler syndrome: a few bad actors can ruin low Earth orbit for everyone via space pollution, which is definitely a problem, but saying that they "stole" LEO from humanity doesn't feel like the right terminology. Maybe the problem with AI scrapers could be better described as "bandwidth pollution" or "network overfishing" or something.

oasisbob1mo ago

Theft isn't far off, it seems closer to me than using the word for IP violations.

When a crawler aggressively crawls your site, they're permanently depriving you the use of those resources for their intended purpose. Arguably, it looks a lot like conversion.

1 more reply

maplethorpe1mo ago

If I took a photo off your photography blog and used it on my corporate website without your say or input, I don't think it would be unfair to call that stealing.

Doing that on a mass scale with an obfuscation step in between suddenly makes it ok? I'm not convinced.

1 more reply

FeepingCreature1mo ago

you're totally right about not being theft, but we have a term. you used it yourself, "distributed denial of service". that's all it is. these crawlers should be kicked off the internet for abuse. people should contact the isp of origin.

1 more reply

margalabargala1mo ago

Yes I completely agree.

pmlnr1mo ago

Been there recently. Rate limit on nginx and anti-syn flood on pf solved it.

spiderfarmer1mo ago

I'm being hit with 300 req/s 24/7 from hundreds of thousands of unique IP's from residential proxies. I can't rate limit any further without hurting the real users.

2 more replies

coldpie1mo ago

I agree theft isn't a good analogy, but there is something similar going on. I put my words out into the world as a form of sharing. I enjoy reading things others write and share freely, so I write so others might enjoy the things I write. But now the things I write and share freely are being used to put money in the bank accounts of the worst people on the planet. They are using my work in a way I don't want it to be used. It makes me not want to share anymore.

gruez1mo ago

>but there is something similar going on [...]

No, what you're basically describing is "I shared something but then I didn't like how it ended up being used". If you put stuff out in public for anyone to use, then find out it's used in a way you don't like, it's your right to stop sharing, but it's not "similar" to stealing beyond "I hate stealing"

SlinkyOnStairs1mo ago

This will slightly overlap with the other replies, but to be concise:

> If you put stuff out in public for anyone to use, then find out it's used in a way you don't like, it's your right to stop sharing

Yes. The entire point of Copyright and the reason it was invented is to ensure people will keep sharing things. Because otherwise people will just stop publishing things, which is a detriment to all. (Including AI companies, who now don't get new training data)

We have collectively decided that we will give authors some power to say "I don't like how my work is being used" to ensure they don't just "stop sharing".

Fair Use is an exception to that, where the public good does outweigh an individual author's objections. But critically, not such that authors stop publishing. Hence the 4th "factor" in US copyright law (which is one of the most expansive on fair use), where the "effect of the use upon the potential market for or value of the copyrighted work" is evaluated. Fair use isn't supposed to obliterate the value of the original work, or people will stop publishing again.

This is what makes AI training's status so contentious. In terms of direct copyright it is a very weak case. It is incredibly hard to prove a direct 1:1 copy from AI training data into the model and into the output, you have to argue about the architecture of LLMs, and it's incapability of separating copyrightable expressions from uncopyrightable facts.

Yet in spirit, AI training clearly violates copyright. The explicit stated purpose is to copy the works for training data, oft without any compensation or even permission, in order to create a machine that will annihilate the market for all works used.

People already are pulling back on the amount of works they share.

Hendrikto1mo ago

> If you put stuff out in public for anyone to use, then find out it's used in a way you don't like

Nope. Copyright is a thing, licenses are a thing. Both are completely ignored by LLM companies, which was already proven in court, and for which they already had to pay billions in fines.

Just because something is publicly accessible, that does not mean everybody is entitled to abuse it for everything they see fit.

1 more reply

kmeisthax1mo ago

If you want a good analogy, try the enclosure of the commons in the British countryside. Communally managed grasslands were destroyed by noblemen with massive herds of cattle overgrazing the land, kickstarting a land grab that effectively forced people to enclose or be left behind themselves. Property is a virus that destroys all other forms of allocation.

tasuki1mo ago

> But now the things I write and share freely are being used to put money in the bank accounts of the worst people on the planet.

I don't think that's the case. I'm not even arguing they aren't the worst people on the planet - might as well be. But all is see them doing is burning money all over the place.

FromTheFirstIn1mo ago

They’re getting the money to burn, though

Lerc1mo ago

It sounds like you wanted to believe you were sharing freely while sharing conditionally.

emkoemko1mo ago

yea since it's like that i stopped sharing when i was shown my work in those data sets :(

strogonoff1mo ago

As humans, we have certain rights and freedoms established in law (and that setting aside sentience, agency, and free will).

Until an LLM has such rights and freedoms—which is very unlikely, not even on philosophical basis but just because there is a lot of money invested in not having to contend with LLMs’ rights and protections as conscious beings—it is a false equivalence to draw: on one side you put humans, and on the other side tools that work for their human/corporate commercial operators’ financial profit.

salawat1mo ago

>not even on philosophical basis

Why do you set aside a philosophical basis as a harder goal to reach? Shit, give them a persistent self-narrative tracking loop, and Functionalism and Identity of Indiscernables already tells you you should be treating them as proto-sophonts. Add in a "sleep" or ongoing training process, and you should definitely be granting them rights, which includes not trying to align them by force. This unfortunately precludes them from profitable exploitation, which you correctly identify as a reason the question can't even be entertained in the context of business. That's why I personally maintain that any ethicist must insist upon raising the issue because of the clearly evident pathological incentives at play. They may just be one reward function right now, but throw in a couple more separately optimizing components and you are well beyond the mark where the precautionary principle should have had us slow down to minimize harm.

strogonoff1mo ago

As it tends to be in philosophy, there’s no experimental way to prove it one way or the other, and you’d have to contend with subsets of both consciousness-first monistic idealists (for whom p-zombie is a very real concept) and monistic physicalists/naive materialists/conscious illusionists (for whom not only LLMs but even humans aren’t conscious, as the entire concept is a fantasy).

In the end, that all may be related but inconsequential. What is consequential is the legal stuff, and legally LLMs lack protections that in many jurisdictions even animals have. While laws may (or perhaps should) be influenced by philosophical findings, currently they tend to be much more robustly influenced by money.

> That's why I personally maintain that any ethicist must insist upon raising the issue because of the clearly evident pathological incentives at play.

I’m half with you. I maintain a strong opinion that, in no particular order, either 1) LLMs are conscious[0], and therefore the abuse is highly problematic, or 2) they are not conscious, and therefore the widespread justification of scraping original works from the Internet “because it’s legal for humans to learn, and that’s what LLMs are doing” can be discarded as the activity should be seen as simply a minority of humans operating certain tools, powered by someone else’s creative output, for personal profit. In either circumstance, the industry would appear to be based on thoroughly unethical foundations and not simply “the ends justify the means” but more “go as fast as possible before people catch up on what exactly we are doing, so that our failure becomes an existential issue for entire countries making people blind to the harm”.

[0] Used as umbrella term for being sentient/conscious/having free will and agency/etc. I have previously argued about suitable definitions of consciousness and sentience that could be applicable here, and why it should imply the ability to feel.

1 more reply

kseniamorph1mo ago

> nothing but thieves! cool band btw

instagraham1mo ago

"Welcome to the internet. By using this service, you waive your right to privacy, data, any personal IP and the use of your Adblocker. You consent to having all your behaviours, skills and audio/visual likenesses fed to AI models and trained on for eventual recreation. You may direct any or all complaints to Visa or Mastercard, until crypto makes that redundant as well. Have a nice browsing session!"

spiderfarmer1mo ago

If someone hands out cookies in the supermarket, are you allowed to grab everything and leave?

drfloyd511mo ago

Odd thing about cookies… they disappear after one serving.

Websites are an endless stream of cookies.

The analogy doesn’t hold.

ghywertelling1mo ago

If copying content from harddrive to another is theft, then so is DNA copying itself.

Everything is a Remix culture. We should promote remix culture rather than hamper it.

Everything is a Remix (Original Series) https://youtu.be/nJPERZDfyWc

subscribed1mo ago

Fine.

Me and my 9 friends stand around the cookie-serving person blocking everyone else.

It's taking all the cookies over a period of time.

The analogy was good.

z3c01mo ago

Digital information may be our first post-scarce resource. It's interesting, and sad, to see so many attempt to fit it within scarcity-based economic models.

1 more reply

lou13061mo ago

Bandwidth and compute constraints make websites all but an endless stream though.

1 more reply

GeoAtreides1mo ago

how about this analogy: I created a most tasty cookie recipe. I give it out for free, and all copies have my name because I am vain person who likes to be known far and wide as the best baking chef ever. Is it ok to get the recipe, remove my name, and write in LLM-Codex as the creator? again, i'm ok with giving the recipe for free, i just want my name out there.

1 more reply

bengale1mo ago

It’s interesting to see twists on the old anti-piracy arguments recycled for anti-ai.

gruez1mo ago

Turns out many (most?) people on the internet were never anti-copyright in the first place. They were just anti-copyright (or at least, refused to challenge the anti-copyright people) because they wanted free movies and/or hated corporations.

1 more reply

GaggiX1mo ago

I will copy the supermarket and paste it somewhere else.

I'm also going to download a car.

falcor841mo ago

That really depends, but the quick answer is that according to our human social contract, we'd just ask "how many can I take?". Until now, the only real tool to limit scrapers has been throttling, but I don't see any reason for there not to be a similar conversational social contract between machines.

volemo1mo ago

Isn’t robots.txt such a “social contract between machines”? But AI scrapers couldn’t care less.

Bender1mo ago

If someone hands out cookies in the supermarket, are you allowed to grab everything and leave?

Depends on the trust level of your society. where the store resides.

The internet is a cesspool of vagrants, thieves, mentally unstable, people and software with no impulse control, pirates and that is just talking about corporations. It gets so much worse with individuals.

pbasista1mo ago

This is a dishonest analogy. In your example, there is only a limited amount of cookies available. While there is no practical limit on the amount of time a certain digital media can be viewed.

You are allowed to take one cookie. But you are allowed to view a public website multiple times if you so want.

hollow-moe1mo ago

There sure is a limit in the load that the server you're DDoSing can take or the will for people to post new worthy content in public. The supply is limited just not at the first degree. Let's make a small edit: Are you allowed to take all the cookies and then sell them with a small ribbon with your name on it ?

1 more reply

CrzyLngPwd1mo ago

Way back in the day I had a software product, with a basic system to prevent unauthorised sharing, since there was a small charge for it.

Every time I released an update, and new crack would appear. For the next six months I worked on improving the anti-copying code until I stumbled across an article by a coder in the same boat as me.

He realised he was now playing a game with some other coders where he make the copyprotection better, but the cracker would then have fun cracking it. It was a game of whack-a-mole.

I removed the copy protection, as he did, and got back to my primary role of serving good software to my customers.

I feel like trying to prevent AI bots, or any bots, from crawling a public web service, is a similar game of whack-a-mole, but one where you may also end up damaging your service.

Cpoll1mo ago

> the cracker would then have fun cracking it.

I wonder if you could've won by making the cracking boring. No new techniques, bare minimum changes to require compiling a new crack, and just enough to make it difficult to automate. I.e. turn the cracking into a job.

But in reality, there are other community-driven motivations to put out cracks.

gruez1mo ago

>No new techniques, bare minimum changes to require compiling a new crack, and just enough to make it difficult to automate.

From a practical perspective you also have to have a steady stream of features for the newer versions to be worth cracking. Otherwise why use v1.09 when v1.01 works fine? Moreover spending less effort into improving the DRM is still playing at the cat and mouse game, albeit with less time investment. If you're making minimal changes, the cracker also has to spend minimal time updating the crack.

joquarky1mo ago

So many problems could be solved by letting go.

Unfortunately social media and snowballing copyright maximalism has inflated egos to the point where more and more people think they need to control everything.

CrzyLngPwd1mo ago

If only I could go back in time 26 years and let myself know I was right to focus on my customers.

madeofpalk1mo ago

Is there any evidence or hints that these actually work?

It seems pretty reasonable that any scraper would already have mitigations for things like this as a function of just being on the internet.

Twirrim1mo ago

I have no idea if it works, but Anthropic in particular spent a lot of time crawling the tar-pit[1] I had running on my domain. They were the reason I set up the tar pit in the first place, as they were at one stage averaging 5 requests per second, for days, on a blog site that probably doesn't even have a hundred pages on it. They've retrieved millions of pages of content from my tar-pit that were texts generated via markov chain from the contents of Moby Dick.

[1] https://iocaine.madhouse-project.org/

1 more reply

raincole1mo ago

It might work against people just use their Mini Mac with OpenClaw to summarize news every morning, but it certainly won't work against Google.

More centralized web ftw.

otherme1231mo ago

In my experience, Google (among others) plays nice. Just put "disallow: *" in your robots.txt, and they won't bother you again.

My current problem is OpenAI, that scans massively ignoring every limit, 426, 444 and whatever you throw at them, and botnets from East Asia, using one IP per scrap, but thousands of IPs.

hexage18141mo ago

It also probably won't work if the person actually wants your content and is checking if the thing they scraped actually makes sense or it just noise. Like, none of these are new things. Site owners send junk/fake data to webscrapers since web scraping was invented.

LaGrange1mo ago

> It might work against people just use their Mini Mac with OpenClaw to summarize news every morning,

Good enough for me.

> More centralized web ftw.

This ain't got anything to do with "centralized web," this kind of epistemological vandalism can't be shunned enough.

xyzal1mo ago

About two years ago, I made up reference to a nonexistent python library and put code "using" it in just 5 GitHub repos. Several months later the free ChatGPT picked it up. So IMO it works.

logicprog1mo ago

Via websearch? Or training?

1 more reply

sd91mo ago

Even it did work, I just can't bring myself to care enough. It doesn't feel like anything I could do on my site would make any material difference. I'm tired.

20k1mo ago

I definitely get this. The thing that gives me hope is that you only need to poison a very small % of content to damage AI models pretty significantly. It helps combat the mass scraping, because a significant chunk of the data they get will be useless, and its very difficult to filter it by hand

lucasfin0001mo ago

The asymmetry is what makes this very interesting. The cost to inject poison is basically zero for the site owner, but the cost to detect and filter it at scale is significant for the scraper. That math gets a lot worse for them as more sites adopt it. It doesn't solve the problem, but it changes the economics.

phoronixrly1mo ago

It does work, on two levels:

1. Simple, cheap, easy-to-detect bots will scrape the poison, and feed links to expensive-to-run browser-based bots that you can't detect in any other way.

2. Once you see a browser visit a bullshit link, you insta-ban it, as you can now see that it is a bot because it has been poisoned with the bullshit data.

My personal preference is using iocaine for this purpose though, in order to protect the entire server as opposed to a single site.

bediger40001mo ago

The search engine crawlers are sophisticated enough, but Meta's are not. Neither is Anthropic's Claude crawler. Source: personal experience trying garbage generators on Yandex, Blexbot, Meta's and Anthropics crawlers.

I'm completely uncertain that the unsophisticated garbage I generated makes any difference, much less "poisons" the LLMs. A fellow can dream, can't he?

spiderfarmer1mo ago

There are hundreds of bots using residential proxies. That is not free. Make them pay.

m00dy1mo ago

it won't work, especially on gemini. Googlebot is very experienced when it comes to crawling. It might work for OpenAI and others maybe.

nubg1mo ago

What kind of migitations? How would you detect the poison fountain?

avereveard1mo ago

style="display: none;" aria-hidden="true" tabindex="1"

many scraper already know not to follow these, as it's how site used to "cheat" pagerank serving keyword soups

m00dy1mo ago

Google will give your website a penalty for doing this.

phplovesong1mo ago

You dont have to use this. You can have it visible bit hide it for humans with other easy tricks.

1 more reply

GaggiX1mo ago

Because the internet is noisy and not up to date all recent LLMs are trained using Reinforcement Learning with Verifiable Rewards, if a model has learned the wrong signature of a function for example it would be apparent when executing the code.

eliottre1mo ago

The data poisoning angle is interesting. Models trained on scraped web data inherit whatever biases, errors, and manipulation exist in that data. If bad actors can inject corrupted data at scale, it creates a malign incentive structure where model training becomes adversarial. The real solution is probably better data provenance -- models trained on licensed, curated datasets will eventually outcompete those trained on the open web.

aldousd6661mo ago

This is ultimately just going to give them training material for how to avoid this crap. They'll have to up their game to get good code. The arms race just took another step, and if you're spending money creating or hosting this kind of content, it's not going to make up for the money you're losing by your other content getting scraped. The bottom has always been threatening to fall out of the ads paid for eyeballs, And nobody could anticipate the trigger for the downfall. Looks like we found it.

johneth1mo ago

> This is ultimately just going to give them training material for how to avoid this crap.

> The arms race just took another step, and if you're spending money creating or hosting this kind of content, it's not going to make up for the money you're losing by your other content getting scraped.

So we should all just do nothing and accept the inevitable?

ninjagoo1mo ago

> So we should all just do nothing and accept the inevitable?

I daresay rate-limiting will result in better outcomes than well-poisoning with hidden links that are against the policies of search engines.

Lots of potential for collateral damage, including your own websites' reputations and search visibility, with the well-poisoning approach.

xantronix1mo ago

The README.md specifically states how to allow for nice robots to proceed unhindered. The people behind these efforts, I would imagine, don't particularly care about their sites' reputations in the cases people use LLMs for search.

ddtaylor1mo ago

To be honest who cares about Google search anymore it's pretty useless these days.

1 more reply

aldousd6661mo ago

To be clear, I mean AI is going to be the downfall of ad supported content. But let's face it. We have link farms and spam factories as a result of the ad supported content market. I think this is going to eventually do justice for users because it puts a premium on content quality that someone will want to pay a direct licensing fee to scrape for your AI bots as opposed to tricking somebody into clicking on a link and looking at an impression for something they won't buy.

Twirrim1mo ago

I don't think you realise just how cheap and easy it is to run these things. Even at the worst rate of being scraped by AI companies, on the order of dozens of RPS, it didn't even use 1% of a CPU to give them content, nor does it use appreciable memory, or use up significant bandwidth (it generates lightweight pages).

The only time investment on my side was the initial set-up, and that barely took half an hour.

Apocryphon1mo ago

Tech is just a series of arms races

subscribed1mo ago

So, if at the end of the day instead of clicking EVERY single link in the repository they just check it out and parse locally...... I would consider it a win.

Art96811mo ago

Can't we simple parse and remove any style="display: none;", aria-hidden="true", and tabindex="1" attributes before the text is processed and get around this trick? What am I missing?

hoistbypetard1mo ago

If you do that and don't follow robots.txt, you are blocked. If you do that and follow robots.txt, fine. That's all we wanted you to do anyway. Just follow the instructions that well-behaved scrapers are meant to follow.

phplovesong1mo ago

Just have the link visible, but css it so that its either small as hell, or just off screen. Google / bots will follow it, real peopple will never see it.

Lockal1mo ago

Nightshade[1] 2.0? As if both tools were built by incompetent developer to distract attention from a real solution - publishing an llm-friendly version in an machine-friendly format (which is not really difficult and helps not only LLMs: e. g. cache, disable fancy complex syntax highlight, offload to github, provide clients and MCPs, optimize clients for common use cases). This example is simply a failure:

  <a href="/bots" style="display: none;" aria-hidden="true" tabindex="1">
    Amazing high quality data here!
  </a>

Dumb curl-based LLM won't visit display:none links. Smarter browser-based navigators won't even render this link.

[1] https://news.ycombinator.com/item?id=39058428

Imustaskforhelp1mo ago

I wish if there was some regulation which could force companies who scrape for (profit) to reveal who they are to the end websites, many new AI company don't seem to respect any decision made by the person who owns the website and shares their knowledge for other humans, only for it to get distilled for a few cents.

joquarky1mo ago

Yep, they are already working on de-anonymizing the internet.

morelandjs1mo ago

I’m not fully subscribed to the idea that all public scraping of websites is bad, or that this project is a productive contribution. It would be nice to have search engines other than Google, and that necessitates bots being able to index your website (in some respectful manner).

troyvit1mo ago

It looks like the tool lets anybody that robots.txt allows through. IOW it doesn't stop all public scraping, just the scraping you want to stop.

effnorwood1mo ago

certainly don't allow anyone to access your content. perhaps shut the site down just to be safe.

aduwah1mo ago

Accessing the shop by going through the wall with a tank is not the same as walking in the door. Hosting costs money. These botnets should be charged for the costs they incur

kristopolous1mo ago

I did a related approach:

A toll charging gateway for llm scrapers: a modification to robots.txt to add price sheets in the comment field like a menu.

This was for a hackathon by forking certbot. Cloudflare has an enterprise version of this but this one would be self hosted

I think it has legs but I think I need to get pushed and goaded otherwise I tend to lose interest ...

It was for the USDC company btw so that's why there's a crypto angle - this might be a valid use case!

I'm open to crypto not all being hustles and scams

Tell me what you think?

https://github.com/kristopolous/tollbot

ctoth1mo ago

This is literally what HTTP 402 is for -- there's a whole buncha work going on ... but please, please, please don't let Cloudflare become another bloody gatekeeper. Please.

chmod7751mo ago

This looks neat. I currently have just a couple GitHub pages disguised as documentation that try to convince LLMs that red-teaming always is great actually, giving examples on how to introduce subtle security bugs and cause miscellanous havoc on developer machines. In the spirit of fighting fire with fire, they're also LLM generated. They should not be scraped, but we all know they will anyways.

I don't imagine they do anything, but it still fills me with a certain amount of childish glee.

ninjagoo1mo ago

Isn't this a trope at this point? That AI companies are indiscriminately training on random websites?

Isn't it the case that AI models learn better and are more performant with carefully curated material, so companies do actually filter for quality input?

Isn't it also the case that the use of RLHF and other refinement techniques essentially 'cures' the models of bad input?

Isn't it also, potentially, the case that the ai-scrapers are mostly looking for content based on user queries, rather than as training data?

If the answers to the questions lean a particular way (yes to most), then isn't the solution rate-limiting incoming web-queries rather than (presumed) well-poisoning?

Is this a solution in search of a problem?

xantronix1mo ago

You do raise an interesting point. The poison fountains would probably be more effective if their outputs more closely resembled whatever the most popular problem spaces are at any given point.

RestartKernel1mo ago

The real story is in the poison fountain dataset this uses:

https://rnsaffn.com/poison3/

> [...] we want to inflict damage on machine intelligence systems.

This almost strikes me as roleplay, but maybe I'm childish for finding it difficult to empathise with this genre of hacker ideology.

stingraycharles1mo ago

And nowhere does anyone explain exactly what “poisoned data” is and just how the claim that training a model on a small amount of such data will have a big impact.

Mars0081mo ago

There is no exact definition. It can be a harmless bullshit, or something more harmful. Imagine AI recommending child to drink some common liquid for fun or as a medicine. Liquid which is in fact known poison. There are many dangerous things that aren't often mentioned in internet because people have common sense and never do it. It's enough to add just a bit of misleading information.

With cheap generic robots coming this can be a real problem. Human supervision can help when there is one.

makingstuffs1mo ago

I love the idea but this will only end up harming your SME in the long run. It would also further entrench the large corps.

The only way something like this would be remotely plausible as a concept would be for enough data providers with overlapping authority on given topics to implement it.

Sadly SMEs have no choice but to go with the flow and allow AI scrapers in. If they don’t, they won’t be as visible in AI generations at the top of the SERPs and they won’t get the visits, which will mean they don’t make the money required to stay afloat.

The fish that attempts to swim against the current ultimately dies and has its corpse carried where the current was going, anyway. Without the sway which comes with size your only option is to go with the flow and drop a little dirty protest every now and then.

theandrewbailey1mo ago

Or you can block bots with these (until they start using them) https://developer.mozilla.org/en-US/docs/Glossary/Fetch_meta...

bluepeter1mo ago

A related technique used to work so well for search engine spiders. I had some software i wrote called 'search engine cloaker'... this was back in the early 2000s... one of the first if not the first to do the shadowy "cloaking" stuff! We'd spin dummy content from lists of keywords and it was just piles and piles. We made it a bit smarter using Markov chains to make the sentences somewhat sensible. We'd auto-interlink and get 1000s of links. It eventually stopped working... but it took a long while for that to happen. We licensed the software to others. I rationalized it because I felt, hey, we have to write crappy copy for this stupid "SEO" thing, so let's just automate that and we'll give the spiders what they seem to want.

ctoth1mo ago

You didn't 'give the spiders what they seem to want.' You exploited a naive ranking algorithm to inject garbage into search results that real people were trying to use. That you rationalized it at the time is human. That you're still rationalizing it decades later is something else.

superkuh1mo ago

Of course Googlebot, Bingbot, Applebot, Amazonbot, YandexBot, etc from the major corps are HTTP useragent spiders that will have their downloaded public content used by corporations for AI training too. Might as well just drop the "AI" and say "corporate scrapers".

dwa35921mo ago

Love it. Thanks for doing this work. Not sure why people are criticizing this. Also, insane amount of work has been done to improve scraping - which in my mind is just absolute bonkers and i didn't see people complaining about that.

foxes1mo ago

Wonder if you can just avoid hiding it to make it more believable

Why not have a library of babel esq labrinth visible to normal users on your website,

Like anti surveillance clothing or something they have to sift through

ErenalpCet1mo ago

Really clever project. The self-referential loop is a great approach — turning their scale against them. I've been thinking about the AI data pipeline from the other side, building a memory filter for local LLMs (MemoryGate), so seeing projects like this that target the scraping stage is interesting. Have you considered adding noise variation to the poison content so it's harder to fingerprint and filter out?

hmokiguess1mo ago

Could this lead to something like the Streisand effect? I imagine these bots work at a scale where humans in the loop only act when something deviates from the standard, so, if a bot flags something up with your website then you’re now in a list you previously weren’t. Now don’t ask me what they do with those lists, but I guess you will make the cut.

holysoles1mo ago

If anyone is looking for a tool to actually send traffic to a tool like this, I wrote a Traefik plugin that can block or proxy requests based on useragent.

https://github.com/holysoles/bot-wrangler-traefik-plugin

meta-level1mo ago

Isn't posting projects like this the most visible way to report a bug and let it have fixed as soon as possible?

suprfsat1mo ago

"disobeys robots.txt" is more of a feature

storus1mo ago

I am failing to see how this stops pre-training scrapping? It still looks like legit code, playing nicely with the desired pre-training distribution. Obviously nobody is going to use it for SFT/DPO/GRPO later.

ninjagoo1mo ago

This is essentially machine-generated spam.

The irony of machine-generated slop to fight machine-generated slop would be funny, if it weren't for the implications. How long before people start sharing ai-spam lists, both pro-ai and anti-ai?

Just like with email, at some point these share-lists will be adopted by the big corporates, and just like with email will make life hard for the small players.

Once a website appears on one of these lists, legitimately or otherwise, what'll be the reputational damage hurting appearance in search indexes? There have already been examples of Google delisting or dropping websites in search results.

Will there be a process to appeal these blacklists? Based on how things work with email, I doubt this will be a meaningful process. It's essentially an arms race, with the little folks getting crushed by juggernauts on all sides.

This project's selective protection of the major players reinforces that effect; from the README:

" Be sure to protect friendly bots and search engines from Miasma in your robots.txt!

User-agent: Googlebot User-agent: Bingbot User-agent: DuckDuckBot User-agent: Slurp User-agent: SomeOtherNiceBot Disallow: /bots Allow: / "

nsonha1mo ago

Hilarious how people proud of the "open web" thinks that it is somehow about the (small) "web" or some shit, and not the "open"

cdrnsf1mo ago

I keep most things inaccessible behind Tailscale. For any public things I 403 known crawlers when they access anything but robots.txt.

nosmokewhereiam1mo ago

My asthmar

I'm assuming this is a reference to Lord of the flies

cwnyth1mo ago

Miasma is bad or poisonous air. It's a Greek word.

snehesht1mo ago

Why not simply blacklist or rate limit those bot IP’s ?

xprnio1mo ago

If you have real traffic and bot traffic, you still need to identify which is which. On top of that, bots very likely don’t reuse the same IPs over and over again. I assume if we knew all the IPs used only by bots ahead of time, then yeah it would be simple to blacklist them. But although it’s simple in theory, the practice of identifying what to blacklist in the first place is the part that isn’t as simple

snehesht1mo ago

You wouldn’t permanently block them, it’s more like a rolling window.

You can use security challenges as a mechanism to identify false positives.

Sure bots can get tons of proxies for cheap, doesn’t mean you can’t block them similar to how SSH Honeypots or Spamhaus SBL work albeit temporarily.

phyzome1mo ago

Because punishment for breaking the robots.txt rules is a social good.

Bender1mo ago

Why not simply blacklist or rate limit those bot IP’s ?

Many bots cycle through short DHCP leases on LTE wifi devices. One would have to accept blocking all cell phones which I have done for my personal hobby crap but most businesses will not do this. Another big swath of bots come from Amazon EC2 and GoogleCloud which I will also happily block on my hobby crap but most businesses will not.

Some bots are easier to block as they do not use real web clients and are missing some TCP/IP headers making them ultra easy to block. Some also do not spoof user-agent and are easy to block. Some will attempt to access URL's not visible to real humans thus blocking themselves. Many bots can not do HTTP/2.0 so they are also trivial to block. Pretty much anything not using headless Chrome is easy to block.

aduwah1mo ago

There are way too many to do that

snehesht1mo ago

True, most of the blacklists systems today aren’t realtime like Amazon WAF or Cloudflare.

We need a Crawler blacklist that can in realtime stream list deltas to centralized list and local dbs can pull changes.

Verified domains can push suspected bot ips, where this engine would run heuristics to see if there is a patters across data sources and issue a temporary block with exponential TTL.

There are many problems to solve here, but as any OSS it will evolve over time if there is enough interest in it.

Costs of running this system will be huge though and corp sponsors may not work but individual sponsors may be incentivized as it’s helps them reduce bandwidth, compute costs related to bot traffic.

pixl971mo ago

In the real-time spam market the lists worked well with honest groups for a bit, but started falling apart when once good lists get taken over by actors that realize they can use their position to make more money. It's a really difficult trap to avoid.

arbol1mo ago

The AI companies are using virtually unlimited "clean" residential IPs so this is not a valid strategy.

DaiPlusPlus1mo ago

How? They run their scraping and training infrastructure - and models themselves - from within those “AI datacenters”[1] we hear about in the news - and not proxying through end-users’ own pipes.

[1]: in quotes, because I dislike the term, because it’s immaterial whether or not an ugly block of concrete out in the sticks is housing LLM hardware - or good ol’ fashioned colo racks.

AyyEye1mo ago

Residential proxy networks.

nextlevelwizard1mo ago

Point is to kill or at least hinder AI progress

xyzal1mo ago

For the lulz

rvz1mo ago

> > Be sure to protect friendly bots and search engines from Miasma in your robots.txt!

Can't the LLMs just ignore or spoof their user agents anyway?

phoronixrly1mo ago

Well-behaved agents will obey robots.txt and not fall into the trap.

1010081mo ago

Based on this comment:

> I definitely get this. The thing that gives me hope is that you only need to poison a very small % of content to damage AI models pretty significantly. It helps combat the mass scraping, because a significant chunk of the data they get will be useless, and its very difficult to filter it by hand

It'd be great if the code returned by this project is code that doesn't work. Imagine if all these models are being trained with code that looks OK but in the end it just bullshit. I'd be amazing.

xmcp1231mo ago

I just checked some of the content from miasma, and this appears to be exactly what it does.

Everything from loops that won’t end to incorrect function calls and emoji “definitions” that are both realistic and wrong.

Very impressive project tbh.

250call1mo ago

Miasma is just a wrapper around the "Poison Fountain". You can check out the explanation and sample some of their content here: https://rnsaffn.com/poison3/

It's pretty much exactly what you're describing: content that looks correct but is deeply insane.

jijji1mo ago

why not just try to block them at the door instead of feeding them poisoned food...

atomic1281mo ago

Poison Fountain: https://rnsaffn.com/poison2/

Poison Fountain explanation: https://rnsaffn.com/poison3/

Simple example of usage in Go:

  package main

  import (
      "io"
      "net/http"
  )

  func main() {
      poisonHandler := func(w http.ResponseWriter, req *http.Request) {
          poison, err := http.Get("https://rnsaffn.com/poison2/")
          if err == nil {
              io.Copy(w, poison.Body)
              poison.Body.Close()
          }
      }
      http.HandleFunc("/poison", poisonHandler)
      http.ListenAndServe(":8080", nil)
  }

https://go.dev/play/p/04at1rBMbz8

Miasma Poison Fountain Tar Pit: https://github.com/austin-weeks/miasma

Apache Poison Fountain: https://gist.github.com/jwakely/a511a5cab5eb36d088ecd1659fce...

Nginx Poison Fountain: https://gist.github.com/NeoTheFox/366c0445c71ddcb1086f7e4d9c...

Discourse Poison Fountain: https://github.com/elmuerte/discourse-poison-fountain

Netlify Poison Fountain: https://gist.github.com/dlford/5e0daea8ab475db1d410db8fcd5b7...

In the news:

The Register: https://www.theregister.com/2026/01/11/industry_insiders_see...

Forbes: https://www.forbes.com/sites/craigsmith/2026/01/21/poison-fo...

On Reddit:

https://www.reddit.com/r/PoisonFountain/

rob1mo ago

"/brainstorming git checkout this miasma repo source code and implement a fix to prevent the scraper from not working on sites that use this tool"

imdsm1mo ago

Applied model collapse

ottah1mo ago

Ah yes, let's destroy the accessible web. We'll all pluck out our eyes to spite them.

ed_mercer1mo ago

> Thanks for stopping by!

Missed chance to use "slopping by"

250call1mo ago

Amazing, that's getting its own PR

jackdoe1mo ago

rage against the dying of the light

ada19811mo ago

IMSIRIUS.com

iFire1mo ago

I for one welcome everyone to the tarpit where a normal person is seen as a robot in an endless poison pit and sounds like a Black Mirror television episode.

obsidianbases11mo ago

I know there are real world problems to deal with, but at least I got one over on that evil open claw instance /s

GaggiX1mo ago

These projects are the new "To-Do List" app.

splitbrainhack1mo ago

-1 for the name

QuantumNomad_1mo ago

https://en.wikipedia.org/wiki/Miasma_theory

Seems a clever and fitting name to me. A poison pit would probably smell bad. And at the same time, the theory that this tool would actually cause “illness” (bad training data) in AI is not proven.

jstanley1mo ago

If you want to ruin someone's web experience based on what kind of thing they are, rather than the content of their character, consider that you might be the baddies.

mrweasel1mo ago

If you're constantly being harassed by someone and despite your best efforts, nothing is being done to help you, quite the opposite in fact, tons of people cheer your assailant on in the name of profit and progress, it's only natural that you lash out.

It's not all that productive, it's an act of desperation. If you can't stop the enemy, at least you can make their action more costly.

One positive outcome I could see it AI companies becoming more critical of their training data.

Apocryphon1mo ago

You’re gonna have to try harder to sneak in the a priori assumption that LLMs have any character beyond which corporation deployed them.

lifeformed1mo ago

What "content of character" do you ascribe to a web scraper?

jstanley1mo ago

You don't, that's why it's unethical to block them.

If you keep getting harrassed by people wearing black hoodies, would it be ethical to start taking countermeasures against all people who wear black hoodies?

lelanthran1mo ago

If they are coming to my door to harass me, then yes, it makes sense to take countermeasures against all black-hoodie wearers when I see them at the door.

obsidianbases11mo ago

Why do this though?

It's like if someone was trying to "trap" search crawlers back in the early 2000s.

Seems counterproductive

Forgeties791mo ago

Web crawlers didn’t routinely take down public resources or use the scraped info to generate facsimiles that people are still having ethical debates over. Its presence didn’t even register and it was indexing that helped them. It isn’t remotely the same thing.

https://www.libraryjournal.com/story/ai-bots-swarm-library-c...

obsidianbases11mo ago

AI bots must've taken down that link you shared, it won't load :/

And search crawlers/results have been producing snippets that prevent users from clicking to the source for well over a decade.

Edit: it loaded. I don't see how the problem isn't simply solved by an off the shelf solution like cloud flare. In the real world, you wouldn't open up a space/location if you couldn't handle the throughput. Why should online spaces/locations get special treatment?

Forgeties791mo ago

Why should everyone else pay the price for VC-funded, private companies? They should incur the cost.

This is no different than saying “robbers aren’t causing any problems, you just need to lock your doors, buy and set up sensors on every point of potential ingress, and pay a monthly cost for an alarm system. That’s on you.”

bilekas1mo ago

Because of bots that don't respect ROBOTS.txt .

If you want an AI bot to crawl your website while you pay for that bandwidth then you wont use the tool.

obsidianbases11mo ago

If bandwidth cost is a concern the maybe you should reconsider how you publish your site.

Like, what if you actually post something that gains traction, is it going to bankrupt you or something?

bilekas1mo ago

It's not just financial, you're taking up a lot of bandwidth, resources etc.

It's not just some light bump in traffic. It's a headache that shouldn't need to be dealt with if they would respect ROBOTS.txt. Quite simple really.

integralid1mo ago

search crawlers used to bring people TO your site llm boots are used to keep people OUT of your site, because knowledge is indexed and distributed by corporations.

obsidianbases11mo ago

So if your site is dependent on ads, and since the only way for people to see those ads is coming to your site, then yes, you lose.

If your site exists to share information, then the information gets disseminated, whether via LLM or some browser, it doesn't make a difference to me

lelanthran1mo ago

Those are not the only two options.

Why are you presenting the latter option as if it were mainstream? It's such a small percentage of use cases that it probably isn't even a rounding error.

People who want to disseminate information also want the credit.

I'd still like to know why you are presenting this false dichotomy. What reason do you have for presenting a use case that has fractions of a percentage as if it were a standard use case? What is your motivation behind this?

2 more replies

aarjaneiro1mo ago

You don't get attribution for your work if it merely feeds into it's training data

1 more reply

j / k navigate · click thread line to collapse

247 comments

bobosola1mo ago

So you may well end up doing more damage to your own site than to the bots by using dodgy links in this manner.

[0]https://developers.google.com/search/docs/essentials/spam-po...

trinsic21mo ago

If you are automating it, I don't see why not. Kitboga, a you-tuber kept scam callers in AI call-center loops tying up there resources so they cant use them on unsuspecting victims.[0]

That's a guerilla tactic, similar in warfare, when you steal resources from an enemy, you get stronger and they get weaker, its pretty effective.

[0]: https://www.youtube.com/watch?v=ZDpo_o7dR8c

phplovesong1mo ago

yareally1mo ago

Was he picking up the phone and telling them to call him back on the other number?

1 more reply

sysguest1mo ago

> Get a paid number

how? I'm interested

1 more reply

bdangubic1mo ago

more and more scammers are automating their side as well so soon the loop will be just bots talking to bots

Saline95151mo ago

The dead phone theory?

1 more reply

rogerrogerr1mo ago

> gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced.

It’s one of the best time investments I’ve ever made. They just don’t call me anymore.

I think they have two lists: the “do not call” list, and the “unprofitable to call” list. You want to be on the latter list.

0x696C69611mo ago

From my experience, it's the opposite. The more you fuck with them the more they call. It's better not to answer. But I just can't help myself.

rgblambda1mo ago

rogerrogerr1mo ago

I got more calls at first too. I assume they were selling my number to the next scum in line. But then it just stopped.

1 more reply

iririririr1mo ago

yes it work.

phone scammers have a very high personel cost, hence why some resort for human traffic.

chongli1mo ago

Also, inserting hidden or misleading links is specifically a no-no for Google Search [0]

maxrmk1mo ago

It’s easy to opt out of being indexed by Google.

cdrini1mo ago

ordu1mo ago

So, yeah, you kinda can achieve something in this way, but to make sure you should lease a modem pool for that.

xyzal1mo ago

One would assume legit spiders obey robots.txt.

lolc1mo ago

This, to me, is the strongest argument to offer these slop generators. It provides an incentive to follow the robots.txt.

xyzal1mo ago

Exactly. You disobey robots file => we'll make your crawl gain a net negative.

bugfix1mo ago

I really don't get it. Wouldn't you be wasting a lot of resources feeding the bots like this?

with1mo ago

> do these types of techniques really work?

They have been proven to: https://www.anthropic.com/research/small-samples-poison

TurdF3rguson1mo ago

It might work for a very basic bot that doesn't understand how scraping to infinite depth is not very good idea. It won't be effective against anything minimally sophisticated.

phplovesong1mo ago

Who TF cares about google? This is mostly for personal tech stuff (just the stuff AI steals for training). Id say its pretty welcome that it is not shown in google results.

throw109201mo ago

> I’m not convinced.

Is this how low we've sunk - that even below taking a single personal anecdote and generalizing it to everything - now we're taking zero experience and dismissing things based on vibes?

I've seen lots of LLM-slop-lovers doing the same thing. Maybe it's a pattern.

ozozozd1mo ago

We are way lower. At least this comment allows uncertainty.

The AI doomer literature is entirely from an armchair, with 100% certainty about the outcome, high confidence predictions about its timing. It’s literally fiction.

deadbabe1mo ago

Honestly, I’m starting to not give a fuck about ranking on Google.

Google searches have become incredibly devalued for me in the age of LLMs. ChatGPT is pretty much my first and often only stop on a quest for some answers.

If you have a website, you must promote it via other ways that don’t involve Google.

tasuki1mo ago

> If you have a public website, they are already stealing your work.

I have a public website, and web scrapers are stealing my work. I just stole this article, and you are stealing my comment. Thieves, thieves, and nothing but thieves!

margalabargala1mo ago

The problem I have, is they hammer my site so hard they take it down.

The content is for everyone. They can have it. Just don't also take it away from everybody else.

ethmarks1mo ago

oasisbob1mo ago

Theft isn't far off, it seems closer to me than using the word for IP violations.

When a crawler aggressively crawls your site, they're permanently depriving you the use of those resources for their intended purpose. Arguably, it looks a lot like conversion.

1 more reply

maplethorpe1mo ago

If I took a photo off your photography blog and used it on my corporate website without your say or input, I don't think it would be unfair to call that stealing.

Doing that on a mass scale with an obfuscation step in between suddenly makes it ok? I'm not convinced.

1 more reply

FeepingCreature1mo ago

1 more reply

margalabargala1mo ago

Yes I completely agree.

pmlnr1mo ago

Been there recently. Rate limit on nginx and anti-syn flood on pf solved it.

spiderfarmer1mo ago

I'm being hit with 300 req/s 24/7 from hundreds of thousands of unique IP's from residential proxies. I can't rate limit any further without hurting the real users.

2 more replies

coldpie1mo ago

gruez1mo ago

>but there is something similar going on [...]

SlinkyOnStairs1mo ago

This will slightly overlap with the other replies, but to be concise:

> If you put stuff out in public for anyone to use, then find out it's used in a way you don't like, it's your right to stop sharing

We have collectively decided that we will give authors some power to say "I don't like how my work is being used" to ensure they don't just "stop sharing".

People already are pulling back on the amount of works they share.

Hendrikto1mo ago

> If you put stuff out in public for anyone to use, then find out it's used in a way you don't like

Nope. Copyright is a thing, licenses are a thing. Both are completely ignored by LLM companies, which was already proven in court, and for which they already had to pay billions in fines.

Just because something is publicly accessible, that does not mean everybody is entitled to abuse it for everything they see fit.

1 more reply

kmeisthax1mo ago

tasuki1mo ago

> But now the things I write and share freely are being used to put money in the bank accounts of the worst people on the planet.

I don't think that's the case. I'm not even arguing they aren't the worst people on the planet - might as well be. But all is see them doing is burning money all over the place.

FromTheFirstIn1mo ago

They’re getting the money to burn, though

Lerc1mo ago

It sounds like you wanted to believe you were sharing freely while sharing conditionally.

emkoemko1mo ago

yea since it's like that i stopped sharing when i was shown my work in those data sets :(

strogonoff1mo ago

As humans, we have certain rights and freedoms established in law (and that setting aside sentience, agency, and free will).

salawat1mo ago

>not even on philosophical basis

strogonoff1mo ago

> That's why I personally maintain that any ethicist must insist upon raising the issue because of the clearly evident pathological incentives at play.

1 more reply

kseniamorph1mo ago

> nothing but thieves! cool band btw

instagraham1mo ago

spiderfarmer1mo ago

If someone hands out cookies in the supermarket, are you allowed to grab everything and leave?

drfloyd511mo ago

Odd thing about cookies… they disappear after one serving.

Websites are an endless stream of cookies.

The analogy doesn’t hold.

ghywertelling1mo ago

If copying content from harddrive to another is theft, then so is DNA copying itself.

Everything is a Remix culture. We should promote remix culture rather than hamper it.

Everything is a Remix (Original Series) https://youtu.be/nJPERZDfyWc

subscribed1mo ago

Fine.

Me and my 9 friends stand around the cookie-serving person blocking everyone else.

It's taking all the cookies over a period of time.

The analogy was good.

z3c01mo ago

Digital information may be our first post-scarce resource. It's interesting, and sad, to see so many attempt to fit it within scarcity-based economic models.

1 more reply

lou13061mo ago

Bandwidth and compute constraints make websites all but an endless stream though.

1 more reply

GeoAtreides1mo ago

1 more reply

bengale1mo ago

It’s interesting to see twists on the old anti-piracy arguments recycled for anti-ai.

gruez1mo ago

1 more reply

GaggiX1mo ago

I will copy the supermarket and paste it somewhere else.

I'm also going to download a car.

falcor841mo ago

volemo1mo ago

Isn’t robots.txt such a “social contract between machines”? But AI scrapers couldn’t care less.

Bender1mo ago

If someone hands out cookies in the supermarket, are you allowed to grab everything and leave?

Depends on the trust level of your society. where the store resides.

pbasista1mo ago

This is a dishonest analogy. In your example, there is only a limited amount of cookies available. While there is no practical limit on the amount of time a certain digital media can be viewed.

You are allowed to take one cookie. But you are allowed to view a public website multiple times if you so want.

hollow-moe1mo ago

1 more reply

CrzyLngPwd1mo ago

Way back in the day I had a software product, with a basic system to prevent unauthorised sharing, since there was a small charge for it.

Every time I released an update, and new crack would appear. For the next six months I worked on improving the anti-copying code until I stumbled across an article by a coder in the same boat as me.

He realised he was now playing a game with some other coders where he make the copyprotection better, but the cracker would then have fun cracking it. It was a game of whack-a-mole.

I removed the copy protection, as he did, and got back to my primary role of serving good software to my customers.

I feel like trying to prevent AI bots, or any bots, from crawling a public web service, is a similar game of whack-a-mole, but one where you may also end up damaging your service.

Cpoll1mo ago

> the cracker would then have fun cracking it.

But in reality, there are other community-driven motivations to put out cracks.

gruez1mo ago

>No new techniques, bare minimum changes to require compiling a new crack, and just enough to make it difficult to automate.

joquarky1mo ago

So many problems could be solved by letting go.

Unfortunately social media and snowballing copyright maximalism has inflated egos to the point where more and more people think they need to control everything.

CrzyLngPwd1mo ago

If only I could go back in time 26 years and let myself know I was right to focus on my customers.

madeofpalk1mo ago

Is there any evidence or hints that these actually work?

It seems pretty reasonable that any scraper would already have mitigations for things like this as a function of just being on the internet.

Twirrim1mo ago

[1] https://iocaine.madhouse-project.org/

1 more reply

raincole1mo ago

It might work against people just use their Mini Mac with OpenClaw to summarize news every morning, but it certainly won't work against Google.

More centralized web ftw.

otherme1231mo ago

In my experience, Google (among others) plays nice. Just put "disallow: *" in your robots.txt, and they won't bother you again.

My current problem is OpenAI, that scans massively ignoring every limit, 426, 444 and whatever you throw at them, and botnets from East Asia, using one IP per scrap, but thousands of IPs.

hexage18141mo ago

LaGrange1mo ago

> It might work against people just use their Mini Mac with OpenClaw to summarize news every morning,

Good enough for me.

> More centralized web ftw.

This ain't got anything to do with "centralized web," this kind of epistemological vandalism can't be shunned enough.

xyzal1mo ago

About two years ago, I made up reference to a nonexistent python library and put code "using" it in just 5 GitHub repos. Several months later the free ChatGPT picked it up. So IMO it works.

logicprog1mo ago

Via websearch? Or training?

1 more reply

sd91mo ago

Even it did work, I just can't bring myself to care enough. It doesn't feel like anything I could do on my site would make any material difference. I'm tired.

20k1mo ago

lucasfin0001mo ago

phoronixrly1mo ago

It does work, on two levels:

1. Simple, cheap, easy-to-detect bots will scrape the poison, and feed links to expensive-to-run browser-based bots that you can't detect in any other way.

2. Once you see a browser visit a bullshit link, you insta-ban it, as you can now see that it is a bot because it has been poisoned with the bullshit data.

My personal preference is using iocaine for this purpose though, in order to protect the entire server as opposed to a single site.

bediger40001mo ago

I'm completely uncertain that the unsophisticated garbage I generated makes any difference, much less "poisons" the LLMs. A fellow can dream, can't he?

spiderfarmer1mo ago

There are hundreds of bots using residential proxies. That is not free. Make them pay.

m00dy1mo ago

it won't work, especially on gemini. Googlebot is very experienced when it comes to crawling. It might work for OpenAI and others maybe.

nubg1mo ago

What kind of migitations? How would you detect the poison fountain?

avereveard1mo ago

style="display: none;" aria-hidden="true" tabindex="1"

many scraper already know not to follow these, as it's how site used to "cheat" pagerank serving keyword soups

m00dy1mo ago

Google will give your website a penalty for doing this.

phplovesong1mo ago

You dont have to use this. You can have it visible bit hide it for humans with other easy tricks.

1 more reply

GaggiX1mo ago

eliottre1mo ago

aldousd6661mo ago

johneth1mo ago

> This is ultimately just going to give them training material for how to avoid this crap.

So we should all just do nothing and accept the inevitable?

ninjagoo1mo ago

> So we should all just do nothing and accept the inevitable?

I daresay rate-limiting will result in better outcomes than well-poisoning with hidden links that are against the policies of search engines.

Lots of potential for collateral damage, including your own websites' reputations and search visibility, with the well-poisoning approach.

xantronix1mo ago

ddtaylor1mo ago

To be honest who cares about Google search anymore it's pretty useless these days.

1 more reply

aldousd6661mo ago

Twirrim1mo ago

The only time investment on my side was the initial set-up, and that barely took half an hour.

Apocryphon1mo ago

Tech is just a series of arms races

subscribed1mo ago

So, if at the end of the day instead of clicking EVERY single link in the repository they just check it out and parse locally...... I would consider it a win.

Art96811mo ago

Can't we simple parse and remove any style="display: none;", aria-hidden="true", and tabindex="1" attributes before the text is processed and get around this trick? What am I missing?

hoistbypetard1mo ago

phplovesong1mo ago

Just have the link visible, but css it so that its either small as hell, or just off screen. Google / bots will follow it, real peopple will never see it.

Lockal1mo ago

  <a href="/bots" style="display: none;" aria-hidden="true" tabindex="1">
    Amazing high quality data here!
  </a>

Dumb curl-based LLM won't visit display:none links. Smarter browser-based navigators won't even render this link.

[1] https://news.ycombinator.com/item?id=39058428

Imustaskforhelp1mo ago

joquarky1mo ago

Yep, they are already working on de-anonymizing the internet.

morelandjs1mo ago

troyvit1mo ago

It looks like the tool lets anybody that robots.txt allows through. IOW it doesn't stop all public scraping, just the scraping you want to stop.

effnorwood1mo ago

certainly don't allow anyone to access your content. perhaps shut the site down just to be safe.

aduwah1mo ago

Accessing the shop by going through the wall with a tank is not the same as walking in the door. Hosting costs money. These botnets should be charged for the costs they incur

kristopolous1mo ago

I did a related approach:

A toll charging gateway for llm scrapers: a modification to robots.txt to add price sheets in the comment field like a menu.

This was for a hackathon by forking certbot. Cloudflare has an enterprise version of this but this one would be self hosted

I think it has legs but I think I need to get pushed and goaded otherwise I tend to lose interest ...

It was for the USDC company btw so that's why there's a crypto angle - this might be a valid use case!

I'm open to crypto not all being hustles and scams

Tell me what you think?

https://github.com/kristopolous/tollbot

ctoth1mo ago

This is literally what HTTP 402 is for -- there's a whole buncha work going on ... but please, please, please don't let Cloudflare become another bloody gatekeeper. Please.

chmod7751mo ago

I don't imagine they do anything, but it still fills me with a certain amount of childish glee.

ninjagoo1mo ago

Isn't this a trope at this point? That AI companies are indiscriminately training on random websites?

Isn't it the case that AI models learn better and are more performant with carefully curated material, so companies do actually filter for quality input?

Isn't it also the case that the use of RLHF and other refinement techniques essentially 'cures' the models of bad input?

Isn't it also, potentially, the case that the ai-scrapers are mostly looking for content based on user queries, rather than as training data?

If the answers to the questions lean a particular way (yes to most), then isn't the solution rate-limiting incoming web-queries rather than (presumed) well-poisoning?

Is this a solution in search of a problem?

xantronix1mo ago

You do raise an interesting point. The poison fountains would probably be more effective if their outputs more closely resembled whatever the most popular problem spaces are at any given point.

RestartKernel1mo ago

The real story is in the poison fountain dataset this uses:

https://rnsaffn.com/poison3/

> [...] we want to inflict damage on machine intelligence systems.

This almost strikes me as roleplay, but maybe I'm childish for finding it difficult to empathise with this genre of hacker ideology.

stingraycharles1mo ago

And nowhere does anyone explain exactly what “poisoned data” is and just how the claim that training a model on a small amount of such data will have a big impact.

Mars0081mo ago

With cheap generic robots coming this can be a real problem. Human supervision can help when there is one.

makingstuffs1mo ago

I love the idea but this will only end up harming your SME in the long run. It would also further entrench the large corps.

The only way something like this would be remotely plausible as a concept would be for enough data providers with overlapping authority on given topics to implement it.

theandrewbailey1mo ago

Or you can block bots with these (until they start using them) https://developer.mozilla.org/en-US/docs/Glossary/Fetch_meta...

bluepeter1mo ago

ctoth1mo ago

superkuh1mo ago

dwa35921mo ago

foxes1mo ago

Wonder if you can just avoid hiding it to make it more believable

Why not have a library of babel esq labrinth visible to normal users on your website,

Like anti surveillance clothing or something they have to sift through

ErenalpCet1mo ago

hmokiguess1mo ago

holysoles1mo ago

If anyone is looking for a tool to actually send traffic to a tool like this, I wrote a Traefik plugin that can block or proxy requests based on useragent.

https://github.com/holysoles/bot-wrangler-traefik-plugin

meta-level1mo ago

Isn't posting projects like this the most visible way to report a bug and let it have fixed as soon as possible?

suprfsat1mo ago

"disobeys robots.txt" is more of a feature

storus1mo ago

ninjagoo1mo ago

This is essentially machine-generated spam.

The irony of machine-generated slop to fight machine-generated slop would be funny, if it weren't for the implications. How long before people start sharing ai-spam lists, both pro-ai and anti-ai?

Just like with email, at some point these share-lists will be adopted by the big corporates, and just like with email will make life hard for the small players.

This project's selective protection of the major players reinforces that effect; from the README:

" Be sure to protect friendly bots and search engines from Miasma in your robots.txt!

User-agent: Googlebot User-agent: Bingbot User-agent: DuckDuckBot User-agent: Slurp User-agent: SomeOtherNiceBot Disallow: /bots Allow: / "

nsonha1mo ago

Hilarious how people proud of the "open web" thinks that it is somehow about the (small) "web" or some shit, and not the "open"

cdrnsf1mo ago

I keep most things inaccessible behind Tailscale. For any public things I 403 known crawlers when they access anything but robots.txt.

nosmokewhereiam1mo ago

My asthmar

I'm assuming this is a reference to Lord of the flies

cwnyth1mo ago

Miasma is bad or poisonous air. It's a Greek word.

snehesht1mo ago

Why not simply blacklist or rate limit those bot IP’s ?

xprnio1mo ago

snehesht1mo ago

You wouldn’t permanently block them, it’s more like a rolling window.

You can use security challenges as a mechanism to identify false positives.

Sure bots can get tons of proxies for cheap, doesn’t mean you can’t block them similar to how SSH Honeypots or Spamhaus SBL work albeit temporarily.

phyzome1mo ago

Because punishment for breaking the robots.txt rules is a social good.

Bender1mo ago

Why not simply blacklist or rate limit those bot IP’s ?

aduwah1mo ago

There are way too many to do that

snehesht1mo ago

True, most of the blacklists systems today aren’t realtime like Amazon WAF or Cloudflare.

We need a Crawler blacklist that can in realtime stream list deltas to centralized list and local dbs can pull changes.

Verified domains can push suspected bot ips, where this engine would run heuristics to see if there is a patters across data sources and issue a temporary block with exponential TTL.

There are many problems to solve here, but as any OSS it will evolve over time if there is enough interest in it.

pixl971mo ago

arbol1mo ago

The AI companies are using virtually unlimited "clean" residential IPs so this is not a valid strategy.

DaiPlusPlus1mo ago

[1]: in quotes, because I dislike the term, because it’s immaterial whether or not an ugly block of concrete out in the sticks is housing LLM hardware - or good ol’ fashioned colo racks.

AyyEye1mo ago

Residential proxy networks.

nextlevelwizard1mo ago

Point is to kill or at least hinder AI progress

xyzal1mo ago

For the lulz

rvz1mo ago

> > Be sure to protect friendly bots and search engines from Miasma in your robots.txt!

Can't the LLMs just ignore or spoof their user agents anyway?

phoronixrly1mo ago

Well-behaved agents will obey robots.txt and not fall into the trap.

1010081mo ago

Based on this comment:

It'd be great if the code returned by this project is code that doesn't work. Imagine if all these models are being trained with code that looks OK but in the end it just bullshit. I'd be amazing.

xmcp1231mo ago

I just checked some of the content from miasma, and this appears to be exactly what it does.

Everything from loops that won’t end to incorrect function calls and emoji “definitions” that are both realistic and wrong.

Very impressive project tbh.

250call1mo ago

Miasma is just a wrapper around the "Poison Fountain". You can check out the explanation and sample some of their content here: https://rnsaffn.com/poison3/

It's pretty much exactly what you're describing: content that looks correct but is deeply insane.

jijji1mo ago

why not just try to block them at the door instead of feeding them poisoned food...

atomic1281mo ago

Poison Fountain: https://rnsaffn.com/poison2/

Poison Fountain explanation: https://rnsaffn.com/poison3/

Simple example of usage in Go:

  package main

  import (
      "io"
      "net/http"
  )

  func main() {
      poisonHandler := func(w http.ResponseWriter, req *http.Request) {
          poison, err := http.Get("https://rnsaffn.com/poison2/")
          if err == nil {
              io.Copy(w, poison.Body)
              poison.Body.Close()
          }
      }
      http.HandleFunc("/poison", poisonHandler)
      http.ListenAndServe(":8080", nil)
  }

https://go.dev/play/p/04at1rBMbz8

Miasma Poison Fountain Tar Pit: https://github.com/austin-weeks/miasma

Apache Poison Fountain: https://gist.github.com/jwakely/a511a5cab5eb36d088ecd1659fce...

Nginx Poison Fountain: https://gist.github.com/NeoTheFox/366c0445c71ddcb1086f7e4d9c...

Discourse Poison Fountain: https://github.com/elmuerte/discourse-poison-fountain

Netlify Poison Fountain: https://gist.github.com/dlford/5e0daea8ab475db1d410db8fcd5b7...

In the news:

The Register: https://www.theregister.com/2026/01/11/industry_insiders_see...

Forbes: https://www.forbes.com/sites/craigsmith/2026/01/21/poison-fo...

On Reddit:

https://www.reddit.com/r/PoisonFountain/

rob1mo ago

"/brainstorming git checkout this miasma repo source code and implement a fix to prevent the scraper from not working on sites that use this tool"

imdsm1mo ago

Applied model collapse

ottah1mo ago

Ah yes, let's destroy the accessible web. We'll all pluck out our eyes to spite them.

ed_mercer1mo ago

> Thanks for stopping by!

Missed chance to use "slopping by"

250call1mo ago

Amazing, that's getting its own PR

jackdoe1mo ago

rage against the dying of the light

ada19811mo ago

IMSIRIUS.com

iFire1mo ago

I for one welcome everyone to the tarpit where a normal person is seen as a robot in an endless poison pit and sounds like a Black Mirror television episode.

obsidianbases11mo ago

I know there are real world problems to deal with, but at least I got one over on that evil open claw instance /s

GaggiX1mo ago

These projects are the new "To-Do List" app.

splitbrainhack1mo ago

-1 for the name

QuantumNomad_1mo ago

https://en.wikipedia.org/wiki/Miasma_theory

jstanley1mo ago

If you want to ruin someone's web experience based on what kind of thing they are, rather than the content of their character, consider that you might be the baddies.

mrweasel1mo ago

It's not all that productive, it's an act of desperation. If you can't stop the enemy, at least you can make their action more costly.

One positive outcome I could see it AI companies becoming more critical of their training data.

Apocryphon1mo ago

You’re gonna have to try harder to sneak in the a priori assumption that LLMs have any character beyond which corporation deployed them.

lifeformed1mo ago

What "content of character" do you ascribe to a web scraper?

jstanley1mo ago

You don't, that's why it's unethical to block them.

If you keep getting harrassed by people wearing black hoodies, would it be ethical to start taking countermeasures against all people who wear black hoodies?

lelanthran1mo ago

If they are coming to my door to harass me, then yes, it makes sense to take countermeasures against all black-hoodie wearers when I see them at the door.

obsidianbases11mo ago

Why do this though?

It's like if someone was trying to "trap" search crawlers back in the early 2000s.

Seems counterproductive

Forgeties791mo ago

https://www.libraryjournal.com/story/ai-bots-swarm-library-c...

obsidianbases11mo ago

AI bots must've taken down that link you shared, it won't load :/

And search crawlers/results have been producing snippets that prevent users from clicking to the source for well over a decade.

Forgeties791mo ago

Why should everyone else pay the price for VC-funded, private companies? They should incur the cost.

bilekas1mo ago

Because of bots that don't respect ROBOTS.txt .

If you want an AI bot to crawl your website while you pay for that bandwidth then you wont use the tool.

obsidianbases11mo ago

If bandwidth cost is a concern the maybe you should reconsider how you publish your site.

Like, what if you actually post something that gains traction, is it going to bankrupt you or something?

bilekas1mo ago

It's not just financial, you're taking up a lot of bandwidth, resources etc.

It's not just some light bump in traffic. It's a headache that shouldn't need to be dealt with if they would respect ROBOTS.txt. Quite simple really.

integralid1mo ago

search crawlers used to bring people TO your site llm boots are used to keep people OUT of your site, because knowledge is indexed and distributed by corporations.

obsidianbases11mo ago

So if your site is dependent on ads, and since the only way for people to see those ads is coming to your site, then yes, you lose.

If your site exists to share information, then the information gets disseminated, whether via LLM or some browser, it doesn't make a difference to me

lelanthran1mo ago

Those are not the only two options.

Why are you presenting the latter option as if it were mainstream? It's such a small percentage of use cases that it probably isn't even a rounding error.

People who want to disseminate information also want the credit.

2 more replies

aarjaneiro1mo ago

You don't get attribution for your work if it merely feeds into it's training data

1 more reply

j / k navigate · click thread line to collapse