Also, inserting hidden or misleading links is specifically a no-no for Google Search [0], who have this to say: We detect policy-violating practices both through automated systems and, as needed, human review that can result in a manual action. Sites that violate our policies may rank lower in results or not appear in results at all.
So you may well end up doing more damage to your own site than to the bots by using dodgy links in this manner.
[0]https://developers.google.com/search/docs/essentials/spam-po...
If you are automating it, I don't see why not. Kitboga, a you-tuber kept scam callers in AI call-center loops tying up there resources so they cant use them on unsuspecting victims.[0]
That's a guerilla tactic, similar in warfare, when you steal resources from an enemy, you get stronger and they get weaker, its pretty effective.
It’s one of the best time investments I’ve ever made. They just don’t call me anymore.
I think they have two lists: the “do not call” list, and the “unprofitable to call” list. You want to be on the latter list.
phone scammers have a very high personel cost, hence why some resort for human traffic.
if everyone picked up the phone and wasted a few seconds, it would be enough to make their whole enterprise worthless. but since most people who would not fail shutdown right away, they have the best ROI of any industry. they don't even pay the call for first seconds.
Depending on your goals, this may be a pro or a con. I, personally, would like to see a return of "small web" human-centric communities. If there were tools that include anti-scraping, anti-Google (and other large search crawlers) as well as a small web search index for humans to find these sites, this idea becomes a real possibility.
In 2000s there was some company in Russia selling English courses. It spammed so much, that people were really pissed off. To make long story short, the company disappeared from a public space when Golden Telecom joined the party of retaliatory "spam" calls and make computer to call the company using Golden Telecom modem pool.
So, yeah, you kinda can achieve something in this way, but to make sure you should lease a modem pool for that.
They have been proven to: https://www.anthropic.com/research/small-samples-poison
Is this how low we've sunk - that even below taking a single personal anecdote and generalizing it to everything - now we're taking zero experience and dismissing things based on vibes?
I've seen lots of LLM-slop-lovers doing the same thing. Maybe it's a pattern.
The AI doomer literature is entirely from an armchair, with 100% certainty about the outcome, high confidence predictions about its timing. It’s literally fiction.
Google searches have become incredibly devalued for me in the age of LLMs. ChatGPT is pretty much my first and often only stop on a quest for some answers.
If you have a website, you must promote it via other ways that don’t involve Google.
I have a public website, and web scrapers are stealing my work. I just stole this article, and you are stealing my comment. Thieves, thieves, and nothing but thieves!
The content is for everyone. They can have it. Just don't also take it away from everybody else.
When a crawler aggressively crawls your site, they're permanently depriving you the use of those resources for their intended purpose. Arguably, it looks a lot like conversion.
Doing that on a mass scale with an obfuscation step in between suddenly makes it ok? I'm not convinced.
No, what you're basically describing is "I shared something but then I didn't like how it ended up being used". If you put stuff out in public for anyone to use, then find out it's used in a way you don't like, it's your right to stop sharing, but it's not "similar" to stealing beyond "I hate stealing"
> If you put stuff out in public for anyone to use, then find out it's used in a way you don't like, it's your right to stop sharing
Yes. The entire point of Copyright and the reason it was invented is to ensure people will keep sharing things. Because otherwise people will just stop publishing things, which is a detriment to all. (Including AI companies, who now don't get new training data)
We have collectively decided that we will give authors some power to say "I don't like how my work is being used" to ensure they don't just "stop sharing".
Fair Use is an exception to that, where the public good does outweigh an individual author's objections. But critically, not such that authors stop publishing. Hence the 4th "factor" in US copyright law (which is one of the most expansive on fair use), where the "effect of the use upon the potential market for or value of the copyrighted work" is evaluated. Fair use isn't supposed to obliterate the value of the original work, or people will stop publishing again.
This is what makes AI training's status so contentious. In terms of direct copyright it is a very weak case. It is incredibly hard to prove a direct 1:1 copy from AI training data into the model and into the output, you have to argue about the architecture of LLMs, and it's incapability of separating copyrightable expressions from uncopyrightable facts.
Yet in spirit, AI training clearly violates copyright. The explicit stated purpose is to copy the works for training data, oft without any compensation or even permission, in order to create a machine that will annihilate the market for all works used.
People already are pulling back on the amount of works they share.
Nope. Copyright is a thing, licenses are a thing. Both are completely ignored by LLM companies, which was already proven in court, and for which they already had to pay billions in fines.
Just because something is publicly accessible, that does not mean everybody is entitled to abuse it for everything they see fit.
I don't think that's the case. I'm not even arguing they aren't the worst people on the planet - might as well be. But all is see them doing is burning money all over the place.
Until an LLM has such rights and freedoms—which is very unlikely, not even on philosophical basis but just because there is a lot of money invested in not having to contend with LLMs’ rights and protections as conscious beings—it is a false equivalence to draw: on one side you put humans, and on the other side tools that work for their human/corporate commercial operators’ financial profit.
Why do you set aside a philosophical basis as a harder goal to reach? Shit, give them a persistent self-narrative tracking loop, and Functionalism and Identity of Indiscernables already tells you you should be treating them as proto-sophonts. Add in a "sleep" or ongoing training process, and you should definitely be granting them rights, which includes not trying to align them by force. This unfortunately precludes them from profitable exploitation, which you correctly identify as a reason the question can't even be entertained in the context of business. That's why I personally maintain that any ethicist must insist upon raising the issue because of the clearly evident pathological incentives at play. They may just be one reward function right now, but throw in a couple more separately optimizing components and you are well beyond the mark where the precautionary principle should have had us slow down to minimize harm.
In the end, that all may be related but inconsequential. What is consequential is the legal stuff, and legally LLMs lack protections that in many jurisdictions even animals have. While laws may (or perhaps should) be influenced by philosophical findings, currently they tend to be much more robustly influenced by money.
> That's why I personally maintain that any ethicist must insist upon raising the issue because of the clearly evident pathological incentives at play.
I’m half with you. I maintain a strong opinion that, in no particular order, either 1) LLMs are conscious[0], and therefore the abuse is highly problematic, or 2) they are not conscious, and therefore the widespread justification of scraping original works from the Internet “because it’s legal for humans to learn, and that’s what LLMs are doing” can be discarded as the activity should be seen as simply a minority of humans operating certain tools, powered by someone else’s creative output, for personal profit. In either circumstance, the industry would appear to be based on thoroughly unethical foundations and not simply “the ends justify the means” but more “go as fast as possible before people catch up on what exactly we are doing, so that our failure becomes an existential issue for entire countries making people blind to the harm”.
[0] Used as umbrella term for being sentient/conscious/having free will and agency/etc. I have previously argued about suitable definitions of consciousness and sentience that could be applicable here, and why it should imply the ability to feel.
Websites are an endless stream of cookies.
The analogy doesn’t hold.
Everything is a Remix culture. We should promote remix culture rather than hamper it.
Everything is a Remix (Original Series) https://youtu.be/nJPERZDfyWc
Me and my 9 friends stand around the cookie-serving person blocking everyone else.
It's taking all the cookies over a period of time.
The analogy was good.
I'm also going to download a car.
Depends on the trust level of your society. where the store resides.
The internet is a cesspool of vagrants, thieves, mentally unstable, people and software with no impulse control, pirates and that is just talking about corporations. It gets so much worse with individuals.
You are allowed to take one cookie. But you are allowed to view a public website multiple times if you so want.
Every time I released an update, and new crack would appear. For the next six months I worked on improving the anti-copying code until I stumbled across an article by a coder in the same boat as me.
He realised he was now playing a game with some other coders where he make the copyprotection better, but the cracker would then have fun cracking it. It was a game of whack-a-mole.
I removed the copy protection, as he did, and got back to my primary role of serving good software to my customers.
I feel like trying to prevent AI bots, or any bots, from crawling a public web service, is a similar game of whack-a-mole, but one where you may also end up damaging your service.
I wonder if you could've won by making the cracking boring. No new techniques, bare minimum changes to require compiling a new crack, and just enough to make it difficult to automate. I.e. turn the cracking into a job.
But in reality, there are other community-driven motivations to put out cracks.
From a practical perspective you also have to have a steady stream of features for the newer versions to be worth cracking. Otherwise why use v1.09 when v1.01 works fine? Moreover spending less effort into improving the DRM is still playing at the cat and mouse game, albeit with less time investment. If you're making minimal changes, the cracker also has to spend minimal time updating the crack.
Unfortunately social media and snowballing copyright maximalism has inflated egos to the point where more and more people think they need to control everything.
It seems pretty reasonable that any scraper would already have mitigations for things like this as a function of just being on the internet.
More centralized web ftw.
My current problem is OpenAI, that scans massively ignoring every limit, 426, 444 and whatever you throw at them, and botnets from East Asia, using one IP per scrap, but thousands of IPs.
Good enough for me.
> More centralized web ftw.
This ain't got anything to do with "centralized web," this kind of epistemological vandalism can't be shunned enough.
1. Simple, cheap, easy-to-detect bots will scrape the poison, and feed links to expensive-to-run browser-based bots that you can't detect in any other way.
2. Once you see a browser visit a bullshit link, you insta-ban it, as you can now see that it is a bot because it has been poisoned with the bullshit data.
My personal preference is using iocaine for this purpose though, in order to protect the entire server as opposed to a single site.
I'm completely uncertain that the unsophisticated garbage I generated makes any difference, much less "poisons" the LLMs. A fellow can dream, can't he?
many scraper already know not to follow these, as it's how site used to "cheat" pagerank serving keyword soups
> The arms race just took another step, and if you're spending money creating or hosting this kind of content, it's not going to make up for the money you're losing by your other content getting scraped.
So we should all just do nothing and accept the inevitable?
I daresay rate-limiting will result in better outcomes than well-poisoning with hidden links that are against the policies of search engines.
Lots of potential for collateral damage, including your own websites' reputations and search visibility, with the well-poisoning approach.
The only time investment on my side was the initial set-up, and that barely took half an hour.
<a href="/bots" style="display: none;" aria-hidden="true" tabindex="1">
Amazing high quality data here!
</a>
Dumb curl-based LLM won't visit display:none links. Smarter browser-based navigators won't even render this link.A toll charging gateway for llm scrapers: a modification to robots.txt to add price sheets in the comment field like a menu.
This was for a hackathon by forking certbot. Cloudflare has an enterprise version of this but this one would be self hosted
I think it has legs but I think I need to get pushed and goaded otherwise I tend to lose interest ...
It was for the USDC company btw so that's why there's a crypto angle - this might be a valid use case!
I'm open to crypto not all being hustles and scams
Tell me what you think?
I don't imagine they do anything, but it still fills me with a certain amount of childish glee.
Isn't it the case that AI models learn better and are more performant with carefully curated material, so companies do actually filter for quality input?
Isn't it also the case that the use of RLHF and other refinement techniques essentially 'cures' the models of bad input?
Isn't it also, potentially, the case that the ai-scrapers are mostly looking for content based on user queries, rather than as training data?
If the answers to the questions lean a particular way (yes to most), then isn't the solution rate-limiting incoming web-queries rather than (presumed) well-poisoning?
Is this a solution in search of a problem?
> [...] we want to inflict damage on machine intelligence systems.
This almost strikes me as roleplay, but maybe I'm childish for finding it difficult to empathise with this genre of hacker ideology.
With cheap generic robots coming this can be a real problem. Human supervision can help when there is one.
The only way something like this would be remotely plausible as a concept would be for enough data providers with overlapping authority on given topics to implement it.
Sadly SMEs have no choice but to go with the flow and allow AI scrapers in. If they don’t, they won’t be as visible in AI generations at the top of the SERPs and they won’t get the visits, which will mean they don’t make the money required to stay afloat.
The fish that attempts to swim against the current ultimately dies and has its corpse carried where the current was going, anyway. Without the sway which comes with size your only option is to go with the flow and drop a little dirty protest every now and then.
Why not have a library of babel esq labrinth visible to normal users on your website,
Like anti surveillance clothing or something they have to sift through
The irony of machine-generated slop to fight machine-generated slop would be funny, if it weren't for the implications. How long before people start sharing ai-spam lists, both pro-ai and anti-ai?
Just like with email, at some point these share-lists will be adopted by the big corporates, and just like with email will make life hard for the small players.
Once a website appears on one of these lists, legitimately or otherwise, what'll be the reputational damage hurting appearance in search indexes? There have already been examples of Google delisting or dropping websites in search results.
Will there be a process to appeal these blacklists? Based on how things work with email, I doubt this will be a meaningful process. It's essentially an arms race, with the little folks getting crushed by juggernauts on all sides.
This project's selective protection of the major players reinforces that effect; from the README:
" Be sure to protect friendly bots and search engines from Miasma in your robots.txt!
User-agent: Googlebot User-agent: Bingbot User-agent: DuckDuckBot User-agent: Slurp User-agent: SomeOtherNiceBot Disallow: /bots Allow: / "
I'm assuming this is a reference to Lord of the flies
You can use security challenges as a mechanism to identify false positives.
Sure bots can get tons of proxies for cheap, doesn’t mean you can’t block them similar to how SSH Honeypots or Spamhaus SBL work albeit temporarily.
Many bots cycle through short DHCP leases on LTE wifi devices. One would have to accept blocking all cell phones which I have done for my personal hobby crap but most businesses will not do this. Another big swath of bots come from Amazon EC2 and GoogleCloud which I will also happily block on my hobby crap but most businesses will not.
Some bots are easier to block as they do not use real web clients and are missing some TCP/IP headers making them ultra easy to block. Some also do not spoof user-agent and are easy to block. Some will attempt to access URL's not visible to real humans thus blocking themselves. Many bots can not do HTTP/2.0 so they are also trivial to block. Pretty much anything not using headless Chrome is easy to block.
We need a Crawler blacklist that can in realtime stream list deltas to centralized list and local dbs can pull changes.
Verified domains can push suspected bot ips, where this engine would run heuristics to see if there is a patters across data sources and issue a temporary block with exponential TTL.
There are many problems to solve here, but as any OSS it will evolve over time if there is enough interest in it.
Costs of running this system will be huge though and corp sponsors may not work but individual sponsors may be incentivized as it’s helps them reduce bandwidth, compute costs related to bot traffic.
[1]: in quotes, because I dislike the term, because it’s immaterial whether or not an ugly block of concrete out in the sticks is housing LLM hardware - or good ol’ fashioned colo racks.
Can't the LLMs just ignore or spoof their user agents anyway?
> I definitely get this. The thing that gives me hope is that you only need to poison a very small % of content to damage AI models pretty significantly. It helps combat the mass scraping, because a significant chunk of the data they get will be useless, and its very difficult to filter it by hand
It'd be great if the code returned by this project is code that doesn't work. Imagine if all these models are being trained with code that looks OK but in the end it just bullshit. I'd be amazing.
Everything from loops that won’t end to incorrect function calls and emoji “definitions” that are both realistic and wrong.
Very impressive project tbh.
It's pretty much exactly what you're describing: content that looks correct but is deeply insane.
Poison Fountain explanation: https://rnsaffn.com/poison3/
Simple example of usage in Go:
package main
import (
"io"
"net/http"
)
func main() {
poisonHandler := func(w http.ResponseWriter, req *http.Request) {
poison, err := http.Get("https://rnsaffn.com/poison2/")
if err == nil {
io.Copy(w, poison.Body)
poison.Body.Close()
}
}
http.HandleFunc("/poison", poisonHandler)
http.ListenAndServe(":8080", nil)
}
https://go.dev/play/p/04at1rBMbz8Miasma Poison Fountain Tar Pit: https://github.com/austin-weeks/miasma
Apache Poison Fountain: https://gist.github.com/jwakely/a511a5cab5eb36d088ecd1659fce...
Nginx Poison Fountain: https://gist.github.com/NeoTheFox/366c0445c71ddcb1086f7e4d9c...
Discourse Poison Fountain: https://github.com/elmuerte/discourse-poison-fountain
Netlify Poison Fountain: https://gist.github.com/dlford/5e0daea8ab475db1d410db8fcd5b7...
In the news:
The Register: https://www.theregister.com/2026/01/11/industry_insiders_see...
Forbes: https://www.forbes.com/sites/craigsmith/2026/01/21/poison-fo...
On Reddit:
Missed chance to use "slopping by"
Seems a clever and fitting name to me. A poison pit would probably smell bad. And at the same time, the theory that this tool would actually cause “illness” (bad training data) in AI is not proven.
It's not all that productive, it's an act of desperation. If you can't stop the enemy, at least you can make their action more costly.
One positive outcome I could see it AI companies becoming more critical of their training data.
If you keep getting harrassed by people wearing black hoodies, would it be ethical to start taking countermeasures against all people who wear black hoodies?
It's like if someone was trying to "trap" search crawlers back in the early 2000s.
Seems counterproductive
https://www.libraryjournal.com/story/ai-bots-swarm-library-c...
And search crawlers/results have been producing snippets that prevent users from clicking to the source for well over a decade.
Edit: it loaded. I don't see how the problem isn't simply solved by an off the shelf solution like cloud flare. In the real world, you wouldn't open up a space/location if you couldn't handle the throughput. Why should online spaces/locations get special treatment?
This is no different than saying “robbers aren’t causing any problems, you just need to lock your doors, buy and set up sensors on every point of potential ingress, and pay a monthly cost for an alarm system. That’s on you.”
If you want an AI bot to crawl your website while you pay for that bandwidth then you wont use the tool.
Like, what if you actually post something that gains traction, is it going to bankrupt you or something?
It's not just some light bump in traffic. It's a headache that shouldn't need to be dealt with if they would respect ROBOTS.txt. Quite simple really.
If your site exists to share information, then the information gets disseminated, whether via LLM or some browser, it doesn't make a difference to me
Why are you presenting the latter option as if it were mainstream? It's such a small percentage of use cases that it probably isn't even a rounding error.
People who want to disseminate information also want the credit.
I'd still like to know why you are presenting this false dichotomy. What reason do you have for presenting a use case that has fractions of a percentage as if it were a standard use case? What is your motivation behind this?