I believe that the main reason for SO's decline starting around 2018 was that most of the core technical questions had been answered. There was an enormous existing corpus of accepted answers around fundamental topics, and technology just doesn't change fast enough to sustain the site. Then the LLMs digested the site's (beautifully machine-readable) corpus along with the rest of the internet and now the AIs can give users that info directly, resulting in a downward spiral of traffic to SO, fewer new questions, etc.
Vale, Stack Overflow. You helped me solve many tricky problems.
Even current AI is a 100x better experience than SO ever was.
We can all see how post knowledge scarecity and automated contextual niche adaptation reduces exploitation potential for knowledge production (often itself mere regurgitation), but the 'cures' proposed in the article feel very much worse than the disease.
I believe that the rise of SO was mostly a miracle. A once-in-an-era thing. The evidence is all the other Stack Exchange sites. They all have the same UI and the same moderation model. If SO has some secret sauce they have it too. But most of them were pretty dead and never became an enormous corpus.
Stack Overflow peaked in 2014 before beginning it's downward decline. How is that at all related to GenAI? GPT4 is when we really started seeing these things get used to replace SO, etc., and that would be early 2023 - and indeed the drop gets worse there - but after the COVID era spike, SO was already crashing hard.
Tailwind's business model was providing a component library built on top of their framework. It's a business model that relies on the framework being good enough for people to want to use it to begin with, but being bad enough that they'd rather pay for the component library than build it themselves. The more comfortable it is to use, the more productive it is, the worse the value proposition is for the premium upsell. Even other "open core" business models don't have this inherent dichotomy, much less open source on the whole, so it's really weird to try and extrapolate this out.
The thing is, people turn to LLMs to solve problems and answer questions. If they can't turn to the LLM to solve that problem or answer that question, they'll either turn elsewhere, in which case there is still a market for that book or blog post, or they'll drop the problem and question and move on. And if they were willing to drop the problem or question and move on without investigating post-LLM, were they ever invested enough to buy your book, or check more than the first couple of results on google?
I always found it very frustrating that for a person at the start of the learning curve it was "read only"
Actually asking a naive question there was to get horribly flamed on the site. It, and the people using it, were very keen to explain how stupid you were being
LLMs on the other hand are sweet and welcoming (to a fault) of the naive newbie
I have been learning to use Shell script with the help of LLMs, I could not achieve that using SO
Good riddance
I should note that the first time I asked question, I also spent hours reviewing other unanswered requests - and if memory serves me - while I didn't have solutions for any of them, I gave concrete examples of how to debug the problem(s) to narrow down the variables.
I'd estimate that this was around 2012. Never went back after that.
AI's biggest feature is being able to ask it question and not getting humiliated and judged in the process.
Before the LLM time there was actually the problem that google often showed SEO spam sites that harvested content from stackoverflow.
If the web gets flooded with LLM output and you train on it naively, you’re effectively training on your own prior. That pushes models toward the mean: less surprise, less specificity, more template-y phrasing. It’s like photocopying a photocopy: the sharp edges disappear.
The fix isn’t “never use synthetic data.” It’s to treat it like a controlled ingredient: tag provenance, keep a high-quality human/grounded core, filter aggressively, and anchor training to things that don’t self-contaminate (code that compiles/tests, math with verifiable proofs, retrieval with citations, real user feedback). Otherwise the easiest path is content volume, and volume is exactly what kills signal.
Humans are amazing machines that reduce insane amounts of complexity in bespoke combinations of neural processors to synthesize ideas and emotions. Even Ilya Sutskever has said that he wasn't and still isn't clear at a formal level why GPT works at all (e.g. interpretability problem), but GPT was not a random discovery, it was based on work that was an amalgamation of Ilya and others careers and biases.
In fact, this might be overall good thing, because finally original content will be highly on demand since those companies now use to train their models. But we are probably just in a transition phase.
The other thing is that new sources of input will come, from LLM usage probably, so they cut the middle layer, users input in the LLM is also a form of input, and a hybrid co-creation between users/AI would generate content at much faster rater, which again would be used to train the model, and that would improve their quality.
Even today people don’t even trust themselves to write an email. You see people on HN and reddit openly admitting they used ai to help make their post because they believe they cannot write. The march to illiteracy and ignorance is already taking foot.
Oh boy I can already see what kind of articles those would be.
This is ridiculous - AI doesn't need to be fed a PDF of a Terraform book to know how to Terraform. Blowing out context with hundreds of OCR'd pages of generic text on how to terraform isn't going to help anything.
The model that is broken is really ultimately going to be "content for hire". That's the industry that is going to be destroyed here because it's simply redundant now. Actual artwork, actual literature, actual music... these things are all safe as long as people actually want to experience the creations of others. Corporate artwork, simple documentation, elevator music.... these things are done; I'm sorry if you made a living making them but you were ultimately performing an artisinal task in a mostly soulless way.
I'm not talking about video game artists, mind you, I'm talking about the people who produced Corporate Memphis and Flat Design here. We'll all be better off if these people find a new calling in life.
You are talking about some video game artists, and while not their fault directly, EA pushing out SportsBall 2026 for the 40th year in a row is just a soulless corporate money printing machine.
Rather we became the product.
I do not know what will replace it, but I will not miss websites trying to monetise my attention
People today may have a better sense of the downsides of ad-based services than we did when the internet was becoming mainstream. Back then, the minor inconvenience of seeing a few ads seemed worth all the benefits of access all the internet had to offer. And it probably was. But today the public has more experience with the downsides of relentless advertising optimization and audience capture, so there might be more business models based on something other than advertising. Either way, GenAI advertising is certainly coming.
Yes and no. I complained about ads to a partner back in 1999. He seemed surprised, and said something to the effect of "well that's how the content you are consuming is paid for".
My argument (25 years ago now) was that it wasn't the ads as much as it was the ads blocking content, slowing the page load, being intrusive, etc.
So, even "Back Then" it was an issue. Now, it's on steroids with all the aforementioned behaviours being even worse now (e.g., reload an entire page on mobile in order to load a new ad, pages jumping around as new ads are loaded, more pop-ups, etc.) but exacerbated by the privacy nightmare of weaponized data collection.
I'm not disagreeing with you, I just think the underlying issues were evident very early in the "www" environment.
of course chatgpt might deny it but It is just the tip of the ice berg.
My worries actually are that we might use these models and think we are private but when in actuality, we are not. We are probably gonna see an open source model which is really good for such purposes while being able to run on normal hardware (macs etc.) without much hassle. I tried liquidfm on my mac and its a 1B model and it has some flaws and isn't completely uncensored but I don't know to me it does feel like an more compact and even uncensored model can be built for very simple purposes.
The ONLY reason we are here today is because OpenAI, and Anthropic, by extension, took it upon themselves to launch chat bots trained on whatever datasources they could get in a short amount of time to quickly productize their investments. Their first versions didn't include any references to the source material, and just acted as if they knew everything.
When CoPilot was built as a better auto-complete engine, trained on opensource projects, it was an interesting idea, because it doing what people already did. They searched GitHub for examples of the solution or nudged them in that direction. However, the biggest difference, using other project code was stable, because it came with a LICENSE.md that you then agreed to, and paid it forward. (i.e. "I used code from this project").
CoPilot initially would just inject snippets for you, without you knowing the source. It was only later, they walked that back and if you did use CoPilot, it shows you the most-likely source of the code it used. This is exactly the direction all of the platforms seem headed.
It's not easy to walk back the free-for-all system (i.e. Napster), but I'm optimistic over time it'll become a more fair, pay to access system.
Like the irony is pretty deep with this one about this.
I am not sure if they could've gotten trademark from Inscryption/if they needed it but if they really wanted, I have found inscryption's ouroborous card to look the best and it was honestly how I discovered ouroborous in the first place! (became my favourite card, I love inscryption)
https://static1.thegamerimages.com/wordpress/wp-content/uplo...
Even just searching Ouroborous on internet gave me some genuinely beautiful Ouroborous illustrations (Some stock photos, some not) but even using a stock photo might have made a better idea than using AI generated Ouroboros photo itself?
Oh Man, Y'know I once beat Inscryption's kaycee's mods challenge without doing any blood sacrifice the whole run after I got bored when I had defeated all kaycee's mods
The gecko deck was amazing. I even played some mods of Inscryption. It was so lovely.
The people who play inscryption & people who don't aren't the same lol.
Played it during covid/end of it too & I still sing its
Let's discuss some Inscryption, tell me your favourite card combo.
Mine was ouroborous + the sigil which made it so that whenever a card died you got it back in hand, so now what you got was lets say you have ouroborous and some other card, you get ouroborous card placed lets say on the deck and then you get a squirrel and you have a wolf
you place the squirrel and then kill the ouroborous and squirrel to get wolf but now ouroborous gets back to your hand with better stats for the whole game, then it becomes so strong that you just need a magpie card or such sigil to then look for ouroborous card to make it game over.
Combining Ouroborous with the cockroach sigil + the three blood sigil essentially is an unlimited win game if you put the cockroach sigil on the blood goat too.
Gecko/Stoat are my second best favourite. before doing puzzles you could stack as many sigils as possible and I had the best stoat in the whole world lol, had like 4-5 sigils iirc :)
What about ya! There were also some really fascinating mechanics in the mods that I played too, definitely worth a play!
Copyright was predicated on the notion that ideas and styles can not be protected, but that explicit expressive works can. For example, a recipe can't be protected, but the story you wrap around it that tells how your grandma used to make it would be.
LLMs are particularly challenging to wrangle with because they perform language alchemy. They can (and do) re-express the core ideas, styles, themes, etc. without violating copyright.
People deem this 'theft' and 'stealing' because they are trying to reconcile the myth of intellectual property with reality, and are also simultaneously sensing the economic ladder being pulled up by elites who are watching and gaming the geopolitical world disorder.
There will be a new system of value capture that content creators need to position for, which is to be seen as a more valuable source of high quality materials than an LLM, serving a specific market, and effectively acquiring attention to owned properties and products.
It will not be pay-per-crawl. Or pay-per-use. It will be an attention game, just like everything in the modern economy.
Attention is the only way you can monetize information.
The ONLY things that matter when determining whether copyright was infringed are "access" and "substantial similarity". The first refers to whether the alleged infringer did, or had a reasonable opportunity to, view the copyrighted work. The second is more vague and open-ended. But if these two, alone, can be established in court, then absent a fair use or other defense (for example, all of the ways in which your work is "substantially similar" to the infringed work are public domain), you are infringing. Period. End of story.
The Tetris Company, for example, owns the idea of falling-tetromino puzzle video games. If you develop and release such a game, they will sue you and they will win. They have won in the past and they can retain Boies-tier lawyers to litigate a small crater where you once stood if need be. In fact, the ruling in the Tetris vs. Xio case means that look-and-feel copyrights, thought dead after Apple v. Microsoft and Lotus v. Borland, are now back on the table.
It's not like this is even terribly new. Atari, license holders to Pac-Man on game consoles at the time, sued Philips over the release of K.C. Munchkin! on their rival console, the Magnavox Odyssey 2. Munchkin didn't look like Pac-Man. The monsters didn't look like the ghosts from Pac-Man. The mazes and some of the game mechanics were significantly different. Yet, the judge ruled that because it featured an "eater" who ate dots and avoided enemies in a maze, and sometimes had the opportunity to eat the enemies, K.C. Munchkin! infringed on the copyrights to Pac-Man. The ideas used in Pac-Man were novel enough to be eligible for copyright protection.
It's a foundational principle of copyright law, codified in 17 U.S.C. § 102(b): "In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery"
Now, we can quibble over what qualifies there, but the dichotomy itself is pretty clear.
This goes back to Baker v. Selden (1879) and remains bedrock copyright doctrine.
The Tetris case is overstated. Tetris v. Xio did not establish that The Tetris Company "owns the idea of falling-tetromino puzzle video games." The court explicitly applied the idea-expression dichotomy and found Xio copied specific expressive choices (exact dimensions, specific visual style, particular piece colors). Many Tetris-like games exist legally, and it is the specific expressive elements that were considered in the Xio case.
K.C. Munchkin is old and criticized. That 1982 ruling predates major developments like Computer Associates v. Altai, which established more rigorous methods for filtering out unprotectable elements. The Munchkin decision continues to be debated.
"Substantial similarity" analysis itself incorporates idea-expression filtering. Courts use tests specifically designed to separate protectable expression from unprotectable ideas, especially when considering the four factors of fair use (when applied as a defense.)
For example, copyright duration is far longer than most people think (life of author plus seventy (or plus ninety-five years if corporation). Corporations treat copyright as a way to create moats for themselves and freeze competitors than as a creative endeavor. Most creative works earn little to nothing anyway, while a tiny minority generate the most revenue. And it's not easy to get a copyright or atleast percieved to be easy, so again it incentivises those that can afford lawyers to navigate the legal environment. Also, enforcement of copyright law requires surviellance and censorship.
Truthfully I think there will be a time when people will look at current copyright law the same way we now look at guilds in the middleages.
Also just like SEO to game search engines, "democratized RLHF" has big trust issues.
But, I have to admit that a few years ago LLMs became part of my daily workflow for low-risk repetitive tasks, and sometimes even "sound boarding".
Further, the "model collapse" hypothesis of 2020/2021 seems to have failed to materialize. Maybe we're still too early, and we're not yet seeing negative effects of OpenAI training on OpenAI output. But maybe "slop" is not being rewarded as much as human content, and having humans in the loop (even as readers) is preventing a slide into incoherence.
Will LLMs eventually disincentivize people from producing and publishing new original content? If that content is easily replicated by an LLM query, maybe. And maybe it's not the worst thing in the world. 5 years ago I would have bought an "FFmpeg Cookbook" from O'Reilly, but now I would just tell Claude exactly what I'm trying to achieve. As a consumer, I'm better off, and arguably we've saved the author of a hypothetical FFmpeg Cookbook weeks out of their precious life. Weeks they could spend doing something—anything—more valuable than rewording FFmpeg documentation.
1. I pay OpenAI 2. OpenAI rev shares to StackOverflow 3. StackOverflow mostly keeps that money, but shares some with me for posting 4. I get some money back to help pay OpenAI?
This is nonsense. And if the frontier labs are right about simulated data, as Tesla seems to have been right with its FSD simulated visualization stack, does this really matter anyway? The value I get from an LLM far exceeds anything I have ever received from SO or an O'Reilly book (as much as I genuinely enjoy them collecting dust on a shelf).
If the argument is "fairness," I can sympathize but then shrug. If the argument is sustainability of training, I'm skeptical we need these payment models. And if the argument is about total value creation, I just don't buy it at all.
That seems to be the argument: LLM adoption leads to drop of organic training data, leading LLMs to eventually plateau, and we'll be left without the user-generated content we relied on for a while (like SO) and with subpar LLM. That's what I'm getting from the article anyway.
Still, for the one about organic data (or "pre-war steel") drying out, it's not a threat to model development at all. People repeating this point don't realize that we already have way more data than we need. We got to where we are by brute-forcing the problem - throwing more data at a simple training process. If new "pristine" data were to stop flowing now, we still a) have decent pre-trained base models, and a dataset that's more than sufficient to train more of them, and b) lots of low-hanging fruits to pick in training approaches, architectures and data curation, that will allow to get more performance out of same base data.
That, and the fact that synthetic data turned out to be quite effective after all, especially in the latter phases of training. No surprise there, for many classes of problems this is how we learn as well. Anyone who has experience studying math for maturity exam / university entry exams knows this: the best way to learn is to solve lots of variations of the same set of problems. These variations are all synthetic data, until recently generated by hand, but even their trivial nature doesn't make them less effective at teaching.
That said, what it misses is that the AI prompts themselves become a giant source of data. None of these companies are promising not to use your data, and even if you don't opt-in the person you sent the document/email/whatever to will because they want it paraphrased or need help understanding it.
that is the argument, yes.
Claude clearly got an enormous amount of its content from Stackoverflow. Which has mostly ceased to be a source of new content. However unlike the author I dont see any way to fix this; stackoverflow was only there because people had technical questions that needed answers.
Maybe if the LLMs do indeed start going stale as there's not enough training data for new technologies, Q&A sites like Stackoverflow would still have a place, since people would still resort to asking each other questions rather than LLMs that dont have training data for a newer technology.
The new world is one where someone can have an LLM assisted insight, post it on their blog for free, have it indexed by every agentic search engine, and it becomes part of the zeitgeist. That’s the new data that’ll feed the new models: a better information diet over time. And guess what else: models are getting better at identifying - at scale - the high quality info that’s worth using as training data.
There is a huge difference between giving eggs to your neighbours and sending them to a breakfast restaurant in another state. A connection is made between you and the neighbors. A community forms around it.
What would be the point of having a machine write your thoughts so they can shift a model's weights by infinitesimal amounts? How would that compare to building a small following, getting reader mail, and maybe even meeting a few of them?
My website is used as training data. I get nothing from it. The AI twists words I have carefully selected, misleads people, and strips me of the fruit of my labour. If I lost my entire audience, I would stop doing that work.
Moreover, there is a cost to producing high quality information. Research alone takes a long time when you're putting information online for the first time. I wouldn't do it for a vague chance of affecting the Zeitgeist.
If they did actually stumble on AGI (assuming it didn’t eat them too) it would be used by a select few to enslave or remove the rest of us.
No one in power is going to help unless there's money in it.
Also who's this Dario?
Example 1 is bad, StackOverflow had clearly plateaued and was well into the downward freefall by the time ChatGPT was released.
Example 2 is apparently "open source" but it's actually just Tailwind which unfortunately had a very susceptible business model.
And I don't really think the framing here that it's eating its own tail makes sense.
It's also confusing to me why they're trying to solve the problem of it eating its own tail - there's a LOT of money being poured into the AI companies. They can try to solve that problem.
What I mean is - a snake eating its own tail is bad for the snake. It will kill it. But in this case the tail is something we humans valued and don't want eaten, regardless of the health of the snake. And the snake will probably find a way to become independent of the tail after it ate it, rather than die, which sucks for us if we valued the stuff the tail was made of, and of course makes the analogy totally nonsensical.
The actual solutions suggested here are not related to it eating its own tail anyway. They're related to the sentiment that the greed of AI companies needs to be reeled in, they need to give back, and we need solutions to the fact that we're getting spammed with slop.
I guess the last part is the part that ties into it "eating its own tail", but really, why frame it that way? Framing it that way means it's a problem for AI companies. Let's be honest and say it's a problem for us and we want it solved for our own reasons.
> For each response, the GenAI tool lists the sources from which it extracted that content, perhaps formatted as a list of links back to the content creators, sorted by relevance, similar to a search engine
This literally isn’t possible given the architecture of transformer models and there’s no indication it will ever be.Also Anthropic is doing interesting work in interpretability, who knows what could come out of that.
And could be snake oil, but this startup claims to be able to attribute AI outputs to ingested content: https://prorata.ai/
Well, they could always try actually paying content creators. Unlike - for instance - StackOverflow.
There isn't any clean way to do "contributor gets paid" without adding in an entire mess of "ok, where is the money coming from? Paywalls? Advertising? Subscriptions?" and then also get into the mess of international money transfers (how do you pay someone in Iran from the US?)
And then add in the "ok, now the company is holding payment information of everyone(?) ..." and data breaches and account hacking is now so much more of an issue.
Once you add money to it, the financial inceptives and gamification collide to make it simply awful.
Essentially, Reddit is also eating it's own tail to survive as the flood of low quality irrelevant content is making the platform worse for speakers of all languages but nobody cares because "line go up."
Actually we can. And we will.
There's also huge financial momentum shoving AI through the world's throat. Even if AI was proven to be a failure today, it would still be pushed for many years because of the momentum.
I just don't see how that can be reversed.