I'd be all for forcing these companies to open source their models. I'm game to hear other proposals. But "just stop contributing to the commons" strikes me as a very negative result here.
We desperately need better legal abstractions for data-about-me and data-I-created so that we can stop using my-data as a one-size-fits-all square peg. Property is just out of place here.
If I put something on Github with a GPL 3 license, it's supposed to require anyone with access to the binary to also have access to the source code. The concern is, if you think that it is theft, then someone can train an LLM on your GPL code, and then a for-profit corporation can use the code (or any clever algorithms you've come up with) and effectively "launder" your use of GPL code and make money in the process. It basically would be converting your code from Copyleft to Public Domain, which I think a lot of people would have an issue with.
Copyright and copyleft only deal with source code distribution. Your last sentence is not really true from a factual perspective.
I think if you really believe in the open source free software mentality that code should be available to help everyone and improvements to it should also be available and not locked up behind a corporate wall (e.g., a company using GPL code and releasing it with modifications without redistributing the source code), LLMs should be the least of your worries since they don’t do that action. On a literal level they don’t violate GPLv2/v3.
Perhaps copyright law needs new concepts to respond to this change in capability compared to the past, but so far there has been very little legal success with companies and individuals trying to litigate AI companies for copyright violations. Direct violations have been rare and only get more rare over time as training methods evolve.
That said, haven’t part of the complaints about Copilot and the like been specifically because they are reproducing large chunks of code verbatim?
Wait, are you kidding? This is literally a problem we have today with tools like Copilot.
Another point, there is a lot of free and permissive license content to train AI on, where the GPL or copyright can be respected. In many cases, the violating AI companies knew what they were doing was wrong.
There are no ”commons” in this scenario, there are a few frontier labs owning everything (taking it without attribution) and they have the capability to take it away, or increase prices to a point where it becomes a tool for the rich.
Nobody is doing this for the good of anything, it’s a money grab.
I don't wanna look a gift horse in the mouth here. I'm happy to have benefited from whatever contributions were originally forthcoming and I wouldn't begrudge anybody for no longer going above and beyond and instead reverting to normal behavior.
I just don't get it, it's like you're opposed to people building walls, but you see a particularly large wall which makes you mad, so your response is to go build a wall yourself.
This is why I think permissive licenses are a mistake for most projects. Unlike copyleft licenses, they allow users to take away the freedoms they enjoy from users of derivative works. It's no surprise that dishonest actors take advantage of this for their own gain. This is the paradox of tolerance.
"AI" companies take this a step further, and completely disregard the original license. Whereas copyleft would somewhat be a deterrent for potential abusers, it's not for this new wave of companies. They can hide behind the already loosely defined legal frameworks, and claim that the data is derivative enough, or impossible to trace back, or what have you. It's dishonest at best, and corrupts the last remnants of public good will we still enjoy on the internet.
We need new legal frameworks for this technology, but since that is a glacial process, companies can get rich in the meantime. Especially shovel salespeople.