Would we use the same word if two different humans wrote code that solved two different problems, but one part of each problem was somewhat analogous to a different aspect of a third human's problem, and the third human took inspiration from those parts of both solutions to create code that solved a third problem?
What if it were ten different humans writing ten different-but-related pieces of code, and an eleventh human piecing them together? What if it were 1,000 different humans?
I think "plagiarism", "inspiration", and just "learning from" fall on some continuous spectrum. There are clear differences when you zoom out, but they are in degree, and it's hard to set a hard boundary. The key is just to make sure we have laws and norms that provide sufficient incentive for new ideas to continue to be created.
Programmers are not expected to add an addendum to every file listing all the books, articles, and conversations they've had that have influenced the particular code solution. LLMs are trained on far more sources that influence their code suggestions, but it seems like we actually want a higher standard of attribution because they (arguably) are incapable of original thought.
This isn't just giving credit; it's valuable documentation.
If you're later looking at this function and find a bug or want to modify it, the original source might not have the bug, might have already fixed it, or might have additional functionality that is useful when you copy it to a third location that wasn't necessary in the first copy.
It's also fun to tell Copilot that the code will violate a license. It will seemingly always tell you it's fine. Safe legal advice.
1) Verbatin copy is first-order plagiarism.
2a) Second-order plagiarism of written text would be replacing words with synonyms. Or taking a book paragraph by paragraph and for each one of them, rephrasing it in your own words. Yes, it might fool automated checkers but the structure would still be a copy of the original book. And most importantly, it would not contain any new information. No new positive-sum work was done. It would have no additional value.
Before LLMs almost nobody did this because the chance that it would help in a lawsuit vs the amount of work was not a good tradeoff. Now it is. But LLMs can do "better":
2b) A different kind of second-order plagiarism is using multiple sources and plagiarizing each of them only in part. Find multiple books on the same topic, take 1 chapter from each and order them in a coherent manner. Make it more granular. Find paragraphs or phrases which fit into the structure of your new book but are verbatim from other books. See how granular you can make it.
The trick here is that doing this by hand is more work than just writing your own book. So nobody did it and copyright law does not really address this well. But with LLMs, it can be automated. You can literally instruct an LLM to do this and it will do it cheaper than any human could. However, how LLMs work internally is yet different:
n) Higher-order plagiarism is taking multiple source books, identifying patterns, and then reproducing them in your "new" book.
If the patterns are sufficiently complex, nobody will ever be able to prove what specifically you did. What previously took creative human work now became a mechanical transformation of input data.
The point is this ability to detect and reproduce patterns is an impressive innovation but it's built on top of the work of hundreds of millions[0] of humans whose work was used without consent. The work done by those employed by the LLM companies is minuscule compared to that. Yet all of the reward goes to them.
Not to mention LLMs completely defear the purpose of (A)GPL. If you can take AGPL code and pass it through a sufficiently complex mechanical transformation that the output does the same thing but copyright no longer applies, then free software is dead. No more freedom to inspect and modify.
[0]: Github alone has 100 million users ( https://expandedramblings.com/index.php/github-statistics/ ) and we have reason to believe all of their data was used in training.
As for your free software is dead argument: I think it is worse than that: it takes away the one payment that free software authors get: recognition. If a commercial entity can take the code, obfuscate it and pass it off as their own copyrighted work to then embrace and extend it then that is the worst possible outcome.
> What previously took creative human work now became a mechanical transformation of input data.
At which point I find myself wondering if there's actually a problem. If it was previously permitted due to the presence of creative input, why should automating that process change the legal status? What justifies treating human output differently?
> then free software is dead. No more freedom to inspect and modify.
It seems to me that depends on the ideological framing. Consider a (still entirely hypothetical) world where anyone can receive approximately any software they wish with little more than a Q&A session with an expert AI agent. Rather than free software being dead, such a scenario would appear to obviate the vast majority of needs that free software sets out to serve in the first place.
It seems a bit like worrying that free access to a comprehensive public transportation service would kill off a ride sharing service. It probably would, and the end result would also probably be a net benefit to humanity.
Perfectly embodies the AI "startup" mentality. Nice.. /s
Maybe we could resolve the bit of a conundrum by the op in requiring 'agents' to give credit for things if they did rag them or pull them off the web?
It still doesn't resolve the 'inherent learning' problem.
It's reasonable to suggest that if 'one person did it, we should give credit' - at least in some cases, and also reasonable that if 1K people have done similar things ad the AI learns from that, well, I don't think credit is something that should apply.
But a couple of considerations:
- It may not be that common for an LLM to 'see one thing one time' and then have such an accurate assessment of the solution. It helps, but LLMs tend not to 'learn' things that way.
- Some people might consider this the OSS dream - any code that's public is public and it's in the public domain. We don't need to 'give credit' to someone because they solved something relatively arbitrary - or - if they are concerned with that, then we can have a separate mechanism for that, aka they can put it on Github or Wikipedia even, and then we can worry about 'who thought of it first' as a separate consideration. But in terms of Engineering application, that would be a bit of a detractor.
I think it should.
Sure, if you make a small amount of money and divide it among the 1000 people who deserve credit due to their work being used to create ("train") the model, it might be too small to bother.
But if actual AGI is achieved, then it has nearly infinite value. If said AGI is built on top of the work of the 1000 people, then almost infinity divided by 1000 is still a lot of money.
Of course, the real numbers are way larger, LLMs were trained on the work of at least 100M but perhaps over a billion of people. But the value they provide over a long enough timespan is also claimed to be astronomical (evidenced by the valuations of those companies). It's not just their employees who deserve a cut but everyone whose work was used to train them.
> Some people might consider this the OSS dream
I see the opposite. Code that was public but protected by copyleft can now be reused in private/proprietary software. All you need to do it push it through enough matmuls and some nonlinearities.
What if it was just a single person? I take it you didn't read any of the code in the ocaml vibe pr that was posted a bit ago? The one where Claude copied non just implementation specifics, but even the copyright headers from a named, specific person.
It's clear that you can have no idea if the magic black box is copying from a single source, or from many.
So your comment boils down to; plagiarism is fine as long as I don't have to think about it. Are you really arguing that's ok?
It is actually worse: plagiarism is fine if I'm shielded from such claims by using a digital mixer. When criminals use crypto tumblers to hide their involvement we tend to see that as proof of intent, not as absolution.
LLMs are copyright tumblers.
Indeed, and up until the advent of 'AI' we did. But that incentive is being killed right now and I don't see any viable replacement on the horizon.