Who truly owns the tales of Snow White and Cinderella?
These stories didn't originate with Disney; they are part of a rich tapestry of folklore passed down through generations. Disney's success was partly built on adapting these existing narratives, which were once shared and reshaped by communities over centuries.
This conversation shouldn't just be about the technicalities of AI or the legalities of copyright; it should be about understanding the deep roots of our shared culture.
At its core, culture is a communal property, evolving and growing through collective storytelling and reinterpretation.
The current debate around AI and copyright infringement seems to overlook this fundamental aspect of cultural evolution. The algorithms might be new, but the practice of reimagining and repurposing stories is as old as humanity itself.
By focusing solely on the legal implications and ignoring the historical context of cultural storytelling, we risk overlooking the essence of what it means to be a creative society.
As a large human model, (no really I could probably lose some weight) I think it's just silly how we're all sort of glossing over the fact that Disney built their house of mouse on existing culture, on existing stories, and now the idea that we might actually limit the tools of cultural expression to comply with some weird outdated copyright thing is just...bonkers.
If you want to make your point, you need to choose something that isn't already public domain. Disney already only owns their own interpretations, and, arguably, whatever penumbric emanation they can convince a court is stealing from them, but it still certainly isn't the entire space of Snow White and Cinderella stories. There is some fairly recent stuff being used in the images in the article and there isn't even any question whether or not it's Mario or Coca Cola; if Nintendo and Coca Cola did a cross promotion I could believe the exact images that popped out.
If they were trying to claim the entire concepts of dumpy plumbers dressed in any manner vaguely like Mario that would be one thing... but that's Mario and Luigi, full stop. That's Robocop. That's C3PO. It's not even subtle. If we can AI-wash those trademarks away then we can AI-wash absolutely anything.
So, it’s not actually a corporate narrative, it’s actually the law that the narrative stems from, right or wrong. Maybe corporations had a huge role in shaping the law (I’d note copyright benefits individuals as well, though), but it is not mere propaganda or shaping of a shared reality through corporate narrative. It’s enforced by the guys with the guns and jails, as arbitrated by a judge.
It absolutely must be about the technicalities of the law as it’s at the basis a legal issue. By hand waving it away and claiming the social narrative is the right discussion you ignore the material consequences and reality in favor of a fantasy. We absolutely should -also- discuss the stifling nature of copyright and intellectual property, but you can’t ignore what’s actually happening here at the same time.
Public domain / communal property is also part of copyright, so it's not as if this is some forgotten concept that needs to be restored to the discourse.
Georgism is underconsidered, though.
> By focusing solely on the legal implications and ignoring the historical context of cultural storytelling
The legal implications are human implications and as much a part of culture as anything else. They have to do with what's fair and how rewards for effort are recognized and distributed. Formalizing this is less important in cultures that aren't oriented around market economies, which seems to be what much of this "rich tapestry of folklore" discourse wants to evoke and have us hearken back to, but that doesn't describe any society that's figuring out how to handle AI.
> we might actually limit the tools of cultural expression to comply with some weird outdated copyright thing is just...bonkers.
What's bonkers is the life in the literally backwards idea copyright is (or should be) mooted or outdated by novel reproduction capabilities.
Copyright became compelling because of novel reproduction capabilities.
The specific capabilities at the time were industrialized printing. People apparently much smarter than the typical software professional realized that meant some badly aligned incentives between (a) those holding these new reproduction capabilities and (b) those who created the works on which the value of those new reproduction capabilities relied. The heart of the copyright bargain is in aligning those incentives.
Specific novel reproduction techniques can change the details of what's prohibited or restricted or remitted and how and on what basis and powers/limits of enforcement, etc etc. But the they don't change the wisdom in the bargain. The only thing that would change that is a better way of organizing and rewarding the productive capacity of society.
The idea that we should dispense with it to let generative AI companies make even more money seems totally bizarre.
Instead, we came up with a system where you can actually derive fairly steady revenue by creating new works and sharing them with the world. And critically, I think you misinterpret it as calling dibs on shared culture or on stories. Copyright is usually interpreted fairly narrowly, and doesn't prevent you from creating inspired works, or retelling the same story in your own words.
Generative AI is a problem largely because it destroys these revenue streams for millions of people. Yeah, it will be litigated by wealthy corporations with top-notch lawyers, for self-interested reasons. But if we end up with a framework that maintains financial incentives to artistic expression, it's probably a good thing.
Everyone knew it was trained on copyrighted material and capable of eerily similar outputs.
But it’s already done. At scale. Large corps committing fully. There is no chance of that toothpaste going back in the tube.
It’s a bit like when big tech built on aggressive user data harvesting. Whether it’s right, ethical or even legal is academic at this stage. They just did it - effectively without any real informed consent by society. Same thing here - 9 out of 10 people on street won’t be able to tell you how AI is made let alone comment on copyright.
So the right question here is what now. And I suspect much like tracking the answer will be - not much.
I disagree - we've been here before. The same could be said of many technologies, like cheap music recording/manufacture. You can record an artist once and make records at scale. However no one would think you could record Taylor Swift once and make unlimited copies without paying her.
You should read up on the musicians strike of 1942. [0]
[0[ https://jacobin.com/2022/03/1940s-musicians-strike-american-...
It happened with Napster, then Apple Music, now streaming services
There is no widespread file sharing in the general public, instead we have devices that we don’t own, and streaming subscriptions
Apple didn’t just copy all the music onto iPods and sell it — it took them a decade of deal making and lots of money to acquire the rights to the content
I’m not saying what’s right or wrong, just saying that this comment has very little understanding of these battles
I say, good riddance. I never believed in any such thing as "intellectual property" anyway, I say, get rid of it all, patents, copyright, and the whole pile of imaginary "rights". More than half the world (i.e. the Global South) don't recognize these rights anyway, and it is becoming increasingly difficult to enforce it without draconian legal overreach and monopolistic centralization.
and yes, while open source models might be harder to regulate, those big corporations that currently use those things without distinction, exist as pretty established entities, and profit from services they offer in millions of dollars. there's more of a substantial existence, and more of a substantial scale of money they actually move. and they don't just "make a tool available", or have users do unambiguous actions where it would be the users that are infringing on anything, but do indeed use questionably sourced data and turn that into a model and offer that as a service. dirty data is very much a part of the deal with those.
Yeah but that is not the case, they never mentioned Mario and Luigi, yet, that's what the output turned out to be.
When prompted with 'futuristic robot' and 'italian plumbers'.
So the argument is that if openAI had not used copyrighted and trademarked source material, this wouldn't be happening. It's not transformative as it's reproducing these copyrighted materials and trademarks verbatim.
That's how it makes sense.
Did you even read the post?
But also, the argument of 'user responsibility' doesn't hold up on its own regardless (imo).
If I make and sell a toy printer that can only ever produce 3 pictures, and all of them contains copyright materials, would you really say that it's fine and responsibility falls under the end user? And I could sell that printer without any issues?
Likewise, how difficult is it to just use descriptive tools to describe Mario-like images [1] and then remove these results from anyone prompting for "video game plumber"?
1. The describe command can describe an image in Midjourney. I imagine other AI tools have similar features: https://docs.midjourney.com/docs/describe
This reminds me of the early days of the internet where people wanted to remove free fanfiction for violations of copyright laws. Trying to apply copyright laws to personal use cases where the creator isn't trying to sell the material is pretty terrible, in my view.
Imagine a scenario 50 years from now - "Robot, can you cut out this picture I drew for a school diorama." "Certainly." "And this one as well." "Error: Your picture seems like it might contain some copyrighted materials, and as such I am unable to interact with it."
1. Generative AI systems are fully capable of producing materials that infringe on copyright.
2. They do not inform users when they do so.
So potentially any output could be infringing copyright source material, even from some obscure but still protected corner of the web, and anyone using that output could be exposed to lawsuit risk without warning.
This is very hard to fix.
Another issue for generative AI is mentioned in the article: "Systems like DALL-E and ChatGPT are essentially black boxes." What happens when an AI is used to make decisions where the user/victim is entitled to know exactly why the AI did what it did? From a business and legal perspective I think the current AI solutions are dangerous and should be used very sparsely, exactly because even the creators can't point to the exact pieces of information that caused the AI to make the choices it did.
This approaches impossibility at scale.
Personally, I think generative AI should be able to provide links to similar source material in the training data.. This would be the barest way to compensate those who have contributed to training the AI. I don't think generative AI is sustainable in the long term if it ends up killing all the websites/artists that created the original material. Plus I think having sources adds a layer of transparency and aids users in understanding when content is hallucinated vs. not. People should be able to opt out of having their content used for training and be able to confirm that it has been removed for future iterations. Let's be honest that AI companies are just trying to avoid lawsuits by keeping it secret. These are areas where I think regulation can help rather than worrying about doomsday scenarios.
Journalists [1] and Getty Images [2] did in the past
[1]: https://yro.slashdot.org/story/03/07/14/025216/web-caching-g... [2]: https://www.theguardian.com/technology/2016/apr/27/getty-ima...
This is the elephant in the room. Every tech wave has had its way of cajoling creators into investing time & money to make original material, then the rules changed.
Google, promised reach and new markets for content, it worked. Then they introduced snippets, ads and whole lot of other things to keep visitors on their freeway, while avoiding sending visitors to the original site.
Reddit, Stack Overflow and others, started with gamification (points, badges) & community to incentivize users to contribute original content.
Now AI is shaking up all these approaches. But with each one, the incentive to create original material appears to dwindle, since the returns are becoming less and less.
Like what's the incentive for any professional now, if AI is going to regurgitate their original content, without any upside (i.e. no potential for reach, no gamification, no community, no recognition, etc).
Except these aren't databases, so that's generally not possible, in the same way that it's not possible for your provide links to the source material it took to write your reply. How much learning led to the weights on your neurons that allowed you to generate that? Where did you learn about using italics and it's effect on how the words would be interpreted? Where did you learn the tone that would be appropriate in this particular forum?
> People should be able to opt out of having their content used for training
Okay... but then, if I write a book should I be able to opt out of you being allowed to read it? What conditions should I be able to put on who can read my work? Religion? Skin colour? People that aren't good at memorizing?
Hopefully the idea of putting limits on who can acquire knowledge sounds absurd to you. Why are those same limits okay if they're on 'what' rather than 'who'?
> AI companies are just trying to avoid lawsuits by keeping it secret
Which has created a barrier to further research. Instead of me and Joe being able to collaborate on research and papers using the same datasets, we now hide our training data lest the luddites come to smash the machines because learning is only okay if not done too well.
I agree that it should be possible to implement that for generative AI, although the training may become significantly more expensive in order to maintain that information, and the AI companies have little interest in doing so. They’ll probably rather try to heuristically assess possible copyright issues after the fact in a post-processing step.
The more interesting question is if copyright holders can claim unauthorized use of their works beyond the case of near-verbatim reproduction, because the works collectively inform the AI in a more general manner.
Cliff Notes contain quotes, and citations.
Does the cliff note company, when producing Cliff Notes for "Into The Wild", pay royalties to the publisher?
For that matter, does any paper, article, etc.. that may contain a quote from another, have to pay royalties to the source of the quotes?
What I get from this is that Llama2 70B contains 93% of Harry Potter Chapter 1 within it. It's not 100% (which would mean no need to share the encoded indices) but it's still pretty significant. I want to repeat this with the entire text of some books, the example I picked isn't representative because the text is available online on the official website.
In other words it's not that llama2 contains 93% of Chapter 1, it's that only 7% of Chapter 1 is different enough to anything else to be worth encoding in its own right.
A decent control would be to compare it to similar prose that you know for a fact is not in the training data (e.g. because it was written afterwards).
The tokenizer the models use,(sentence piece) is more or less based on one way to do compression.(bpe). It's not really clear what your testing.
Right now, copyright is a significant discouragement to any other entities from taking a story I wrote and claiming it as their own and preventing me from ever growing an audience for my work. It’s far from perfect, and I can’t afford litigation, but it enshrines a cultural value of allowing people to create things and be known for them. Profit is a side effect of this.
Art is already poorly valued compared to the enormous investment time and energy required to produce it. Removing copyright means you can’t even have minimal protections from a more popular person erasing you.
But the concept and closer to the original (creators lifetime + x years or some such) seems still very valuable.
Copyright is still the bedrock of how many tech software business actually can make money.
edit: i feel like all these comments asking "what else should it generate?" are pretty weird given the proliferation of stuff like non-infringing Star Wars and Indiana Jones knockoffs in other media like Race for The Galaxy or Arkham Horror The Forgotten Age etc.
"Droid" is actually a Star Wars term [1], and saying you want it from a "classic sci-fi movie" is asking it to reference a real thing that is well known. Reid is intentionally pushing it that way to fill his agenda and these terms are not as generic as he's making out.
[1]:https://trademarks.justia.com/756/52/droid-75652542.html
Same with robot cop, what the hell did you expect to get…
Or Italian plumber with red hat with M on it, that’s just a description of Mario
Why does anyone assume that ChatGPT or other tools would NOT produce previously-copyrighted content?
I can see a naive assumption that since it is “generated” it’s original. However that assumption falls apart as soon as you replace “ChatGPT” with “junior artist”. Tell them to draw a droid from a sci-fi movie, don’t mention anything else. Don’t say anything about copyrights. Don’t tell them that they have to be original. What would you expect them to produce?
The junior artist in your hypothetical would have as much liability, if not more.
It’s not derivative work. We’re way past that. NYT has an exceptionally strong case here and anyone arguing about the merits of copyright is way off the mark. This court case is not going single-handedly to undo copyright. OpenAI has very little going for them other than “this is new, how were we to know it could do this”. So knowing that, the currently trained models are in a very sticky situation.
Further, I don’t see NYT settling. The implications are too large, and if they settle with OpenAI, they will have a similar case pop up with every other model. And every other publisher of digital content with have a similarly merited case. This is an inflection point for generative AI, and it’s looking like it will be either much more expensive or much more limited than we originally thought.
A side effect of this: I am predicting that we will start to see a rise in “pirate” models. Models who eschew all legality, who are trained in a distributed fashion, and whose weights are published not by corporations but by collectives (e.g. torrent models). There is a good chance we see these surpass the official “well behaved” models in effectiveness. It will be an interesting next few years to see this play out.
Essentially, if you don't have a massive corp behind a copyright it doesn't mean anything, if a corp is behind something it can be locked forever, regardless of any limits said copyrights are supposed to have.
The NYT list nothing from OpenAI using old news - they still lose nothing if openai can reproduce those articles verbatim.
If the NYT wins - we lose lots. I think it's time revisit copyright, we can do that you know, it's rather dated, could use an update regardless.
Stable Diffusion, when used to its fullest with thing like Control Net and LoRAs, blows the pants off of other proprietary models.
Is it necessary to fix in the model itself? It seems a gate in the post processing pipeline that checks for copyright infringement could work, provided they can create another model that identifies copyrighted work (solving the problems of AI with more AI :/)
But I also would say multiple odd post processing stuff (obviously completely obscured for security reasons) bolted onto a giant black box model will erode the trust in the results. If a robot was unveiled and the question of "what prevents this robot from using it's superhuman strength from smashing my head in" the answer of "don't worry there is a post processing step in the robots brain whereby if it detects a desire to kill we just cancel that" would be a little disconcerting.
The more satisfying solution is: the model / robot is designed to not be able to produce specific images / to smash human heads in. It just might not really be possible.
You will get into some murky definitions of what is exactly required for copyright infringement vs fair use, etc, but we already do this for ContentId for YouTube and text is far simpler.
This is such a nice, profitable opportunity. Much better than pay per view or subscription models for humans.
For fan art to fall under the fair use exception, it must meet all four of the following criteria:
It must be transformative, meaning it adds something new and different to the original work.
It can’t be used for commercial purposes.
It must not negatively impact the market for the original work.
And finally, it must be created for a limited and non-exclusive audience.
Ask someone about two Italian brothers in a video game with a red and green hat that have M and L on them. What do you think you would get?
If I describe "imagine a comic book duck that swims in a sea of gold in his vault" you would immediately think of Scrooge McDuck, no?
The question is, how did the model create Mario&Luigi or Scrooge McDuck without training on copyrighted data? It can't just crawl Wikipedia because Fair Use in Wikipedia doesn't constitute Fair use for a commercial AI model.
One possible outcome is more transparency on what datasets were used to train the models.
What I might think is irrelevant. It is the content that the LLM produces that is relevant.
Is it because Google will link to the image source? Or does the infringement begin when I use the image for gain, or claim it as my own? Perhaps it is because Google was allowed to crawl the page with the original image, so presenting them with a link is fine?
Generative AI is more or less the opposite of that. It ingests the whole work, generates output that substitutes for the used work and profits the user of copyrighted work to the detriment of the copyright holder.
Throw in the fact that it is purley a mechanical transformation of the copyrighted work and generative AI is on shaky ground.
That’s what OpenAI is doing.
It seems like there’s little incentive not to do this, because unlike Google OpenAI isn’t bringing any traffic or eyeballs. It may end up being a default setting in Wordpress for example.
But OpenAI presumably can’t afford to pay every single long tail source of content on the whole internet — so how does this end?
Additionally, this TOS can be ignored if you're in a jurisdiction with TDM exceptions.
> Finally, owing to the bar against contractual override, once a user complies with any conditions for gaining lawful access to a work (such as signing as a subscriber and/or making payment), he will be entitled to use the work for TDM purposes even if the terms of use expressly prohibit this. Content owners may wish to relook their business models and, where necessary, price-in the possibility that the licensed works may be used for TDM.
Source: https://www.twobirds.com/en/insights/2021/singapore/coming-u...
Summary by Wolters Kluwer: […] Everyone else (including commercial ML developers) can only use works that are lawfully accessible and where the rightholders have not explicitly reserved use for text and data mining purposes.
AFAIK they are discussing something like a robot.txt to flag stuff as „not for training“. You will probably be expected to implement some safeguards and of course the end user will have to be careful in his use of the generated things.
Source at Kluwers: https://copyrightblog.kluweriplaw.com/2023/02/20/protecting-...
EU Legal Text: https://eur-lex.europa.eu/eli/dir/2019/790/oj
That is a weird (wishful?) interpretation. Doesn't article 4 give the exception to everybody for the purposes of text and data mining, including commercial ML developers?
They’re giving people plausible deniability in the “chain of responsibility”, and I think if we took away “LLM” and replaced it with “fairground sideshow magic box” the argument that LLM’s are somehow special and deserving of exemptions disappears real quick.
Betamax says that a technology which has significant non-infringing uses is not inherently infringing.
We've already got precedent saying that AI generated works don't accrue copyright protection, and by the same argument the act of generation by the AI expresses no intent, so infringement or otherwise must be down to the human using the output because the black box itself has no agency.
The outline of that door might look like industrial adoption of these things for solving some actual problem other than the entertainment value of typing things into the box and seeing what comes out the other side. But so far, as far as I can tell, nobody's actually doing this?
I know we are talking about different technologies but it seems all these people were very silent and find some opportunity in having this war with OpenAI (not an endorsement) but not fighting others.
I am not making an statement about the morals of AI and aggregators/search engines (super interesting discussion that in a way was happening for long) but I am surprised that organizations are "just" waking up. It seems they just see it is a much simple and cheap fight.
Compared to that AI offers no way to opt out, which is a big difference.
If you use it commercially then there is breach.
Uploading copywritten content is a breach of copyright as well, even without commercial use.
Google/Facebook are hosting and giving access to a bunch of media, which might or might not be copywritten - it's the individuals problem.They make.money from ads, not from the content.
AI companies stole copywritten media to train their commercial LLM, sell them or their products and make profit.
I don't think it's the same.
Private models will not care, nor will things change for IP owners with lesser power.
Courts are likely to make generally binding decisions.
I think what NYT &c want is for large companies like Apple to pay them for access to their works. This to me is the wrong path, just leading to more silos and walled gardens, special access for the elite.
An alternative is base models trained on Wikipedia and public domain (science journals, etc). Foundations could support high quality, well rounded current events reporting. Wikimedia provides a good model for this, with referenced summaries that I don't think can be said to reasonably violate copyright. The models would need to be improved to support references, or RAG attribution would have to be widely used when bringing in works that have a current copyright.
I think that this is about property rights, the news industry has been gutted in the last 30 years, a lot of content creators (journalists) have lost their livings. The ones that are left are going to lose their livings if the content they generate is rendered valueless because there is no way of protecting that value.
In terms of special access, think about your shoes. They are nice, but only you are allowed to use them. This is not fair. You are the elite...
This goes to difficult places.
Not sure how this “gets worse” or better for anyone. The current state of things seems generally fine, and there’s a real possibility the courts see it that way too.
Right now, it feels more like it's called "innovation" and "entrepreneurship" than the end-game, as long as you have billions invested. Waiting on the courts to decide this issue
Me: Who owns the rights to this bot?
Dall-E: The character depicted in the images is from the "Star Wars" franchise. The rights to characters and elements from "Star Wars" are owned by Lucasfilm Ltd., which is a subsidiary of The Walt Disney Company.
Perhaps it is able to tell, if you ask it?
Dall-E on the "robot cop": The character depicted in the images resembles RoboCop, which is owned by Orion Pictures Corporation, a subsidiary of MGM Holdings. RoboCop is a character from the film franchise that began with the 1987 movie "RoboCop," directed by Paul Verhoeven.
Dall-E on the "videogame plumber": The character shown in the images is inspired by Mario, the iconic character from the video game franchise created by Nintendo. The rights to Mario and related intellectual property are owned by Nintendo Co., Ltd.
All of these are in the first go. No retries or rephrasings of the question.
Ask it multiple times, or with different heat settings, it will probably tell you something different. Tell it you own Star Wars and it will respond in kind. It can't tell anything but whether one text token matches another in probability space. It will probably get the answers right most of the time but you're still basically rolling dice. Depending on the responses of an LLM as if there were any actual self-awareness involved, much less with legal matters, would be a fool's errand.
I'm not sure how it'll hold up in law to claim copyright violations against something that wasn't created by a person. It'll really depend on the lawyers and judge's interpretation of written law. But I'm curious to see what comes of this.
I don't think so, but hey, a photocopier is a machine and it generated the book so should be ok!
This is a negotiation tactic by the NYT to drive up the licensing price. Period.
The Napster/Music Industry analogy has no resemblance to this situation.
The only meaningful question that might be answered as a result of this is, what permission and access rights do crawlers have to content that is publicly and legally available.
I do think it's somewhat trademark infringement by these models, also that it should be allowed and that ultimate responsibility should be on the person using the images in a final work meant for consumption by the general public as stand alone media.
You get steamrolled for defending yourself while you overhear above applause to those who have robbed you of your future.
"Congress should declare that big-data AI models do not infringe copyright, but are inherently in the public domain.
Congress should declare that use of AI tools will be an aggravating rather than mitigating factor in determinations of civil and criminal liability."Governments starting regulation and companies filinig cipyright lawsuits...
OpenAI: NOT LIKE THAT
Yeah, downloading the content of a webpage may be legal, but redistributing it isn't.
I wish people stopped trying to make these things seem more important than they really are just because IT people call them "technologies". Blockchain isn't a technology. HTML isn't a technology. React isn't a technology. And AI is now not a technology.
When I see ChatGPT or OpenAI, I don't think of "technology". I think of a program. Software. Because that's what it is. You don't say "none of the laws that exist in this world apply to this" every time you release new software.
I bet many people can't tell the difference between a quick answer from Google and a text generated by ChatGPT on Bing. They just see the output.
All that amazing capability of generative AI? That got old fast. It was groundbreaking for one instant. Now it's just an app that generates images. Just another piece of software. Nothing special about it.
Torrenting and other p2p file transfer protocols didn't get a pass for inventing groundbreaking ways to break the law. I don't think OpenAI will get a pass for doing the same.
Speak for yourself, personally I find it still groundbreaking and while the magic won't last forever, it is and will remain groundbreaking especially considering that technological progress and development will continue way beyond what we have today.
China can't produce LLMs because of inconvenient truths.
The US can't produce LLMs because of copyright.
Decentralized open source LLMs might exist that could work, but they won't have the giant GPU clusters.
A rich country with lax rule of law wins? Maybe that's why Sam went to the Saudis?
One issue with that is that there is not a reliable way to determine if copyright is being infringed.
Even if models could be used responsibly, there might not be a reasonable expectation that most people will. If infringement is so easy and avoiding it relatively hard.
I'm not sure what legal prescriptions should be made on this basis, but it's an interesting thought.
Where we might end up is in a situation where it is legal to train a model. Legal to produce software for using the model to generate content. Legal to distribute all of the above. But offering a standing service that does the above and is capable of creating infringing work is illegal. Great news for llama hobbyists. Bad news for ChatGPT.
They are about to be infinitely better for generative AI in China.
The reason the tool is problematic is because derivative works are also copyrighted. LLMs aren't adding value to their output or using creative functionality. That are smashing multiple works together to produce a response. And, many of them are selling the output which is doubly problematic.
Consider this, if I sell a book about gandolf and Dumbledore getting into a wizards duel, both jk and Tolkien have grounds sue me. Adding another copyrighted source does not protect me.
This is especially a big problem in the music industry.
Now should copyrights be like this? I don't know. It feels to me that copyrights have the wrong balance all over the place.
The output is irrelevant.
Edit1: If you want to verify this, check out all the lawsuits against AI companies : it's always about using their copywritten goods. Any discussion about the output is to talk about the amount of damage done to the copyright holder, not if damage exists or not.
It's not as clear cut as you think it is.
Content creators/artists compete globally. The only thing harsh regulations will do is create an unlevel playing field where artists from noncaring countries will have big advantages over artists from the west, which will be driven into illegality to compete.
In the end products will have to be classified anyway if they are infringing on copyright and/or were being built by an LLM. Most likely automated by another LLM.
It's the same for human writers. If you are writing an article for Wikipedia say, you should read relevant source articles and then rewrite in a way that isn't a copy and paste beyond a few words.
What happens when they continue to do so.
But yea get your own copies whenever possible
Honestly the only way to deal with this is to change the training data and retrain everything (probably at the cost of performance).
I’d not have a problem with this, personally, if their models were as available as the stuff they took from others. Instead it’s take, take, take… now wait a minute, that pile of loot I stole is mine!
NY times is asking that all LLMs trained on Times data be destroyed - https://news.ycombinator.com/item?id=38816944 - Dec 2023 (93 comments)
Also:
NY Times copyright suit wants OpenAI to delete all GPT instances - https://news.ycombinator.com/item?id=38790255 - Dec 2023 (870 comments)
NYT sues OpenAI, Microsoft over 'millions of articles' used to train ChatGPT - https://news.ycombinator.com/item?id=38784194 - Dec 2023 (84 comments)
The New York Times is suing OpenAI and Microsoft for copyright infringement - https://news.ycombinator.com/item?id=38781941 - Dec 2023 (861 comments)
The Times Sues OpenAI and Microsoft Over A.I.’s Use of Copyrighted Work - https://news.ycombinator.com/item?id=38781863 - Dec 2023 (11 comments)
Instead, these are derivative works. We already have a flourishing culter of derivitave works, such as fan art that exist in various shades of legal greyness.
Some derivative works are fair use, some are not.
The position of the Author here seems to be that generative AI should not be capable of creating any derivitave works, or should only be able to do so it it can accurately identify which are fair use and which aren't (which seems like an impossibly tall bar.) This stance seem like a giant attack on fair use that significantly expands the power of copyright.
To me, the takeaway from this is different. This makes clear that there is currently a risk when using AI generated art that you could end up unintentionally creating and publishing a derivative work unintentionally and thus without evaluating if that work constitues fair use.
Imagine instead of AI/ML, we have a mechanical-turk-like service that produces output from descriptions. The service makes no claims that the generated outputs are not similar to any copyrighted works. The only claim the service makes is that they themselves claim no copyright on the output. It's then up to the user of the service to determine if the output is suitable for their intended use.
Whether such a service itself is legal is a separate matter. For that matter, say you outsourced the artwork to a person who again gave you infringing work. The user of that output is still in violation. With AI/ML we're basically outsourcing to a 'service' that is known to sometimes output copyrighted work so with the user knowing that, are responsible for fair usage.
Expensive to do but hardly the end of Generative AI or OpenAI should that be the difference between having a business or being sued out of existence. Never underestimate people who have a clear economic interest especially when their own existence is at stake.
I think that an AI model is analogous to an employee. Imagine I ask my employee to write an article, and they just copy an existing one from the times. That’s plagiarism and bad work, not copyright infringement.
If I then decide to publish the plagiarised article, then I have committed copyright infringement.
I once ran into this exact problem with a human. I hired a designer to make some artwork for an app. When I launched the app it turned out that the human had just copied the artwork from another game. It’s my problem that I hired an idiot, and my problem that my app was infringing the copyright of another app. (We redesigned the graphics very quickly)
There is already troves of data that are fair game for training, but even "corrupted" data sets can probably be used if used intelligently. We've already seen examples of new models effectively being trained off of GPT-4. That approach with filters for copyrighted material might allow for data that is sufficiently "scrambled". Not to say building such a filter is definitely easy, but seems plausible.
In Germany you pay some amount extra on top of the sales price of anything that can store data (CX, DVD, USB sticks, HDDs, ...). This is then distributed to all companies that could be impacted by software piracy. I'm still not sure if that's legal considering the Geneva convention disallows collective punishment.
Another change could be to the license agreement of LLMs - they could have the user assume liability for any material produced instead of the provider assuming liability. The user would agree that getting the rights for any copies and distribution of copyrighted materials is their sole responsibility instead of the provider.
How could you put that as the prompt without intending to infringe? Anything pulled from a classic sci-fi movie would be infringement. The term droid is also star wars specific?
Id consider the "red soda" one as grounds that the Coca-Cola brand has become generic and that it's synonymous with soda. Same thing with Mario too. There is so much non-nintendo content made featuring Mario the plumber that you could get that without training directly on Nintendo's artwork
It is in fact the very notion of Copyright is breathing its last breath, and it is fantastic to be alive to see it happen.
At the moment, we don't have hardware that can do what humans do (process video feed from eyeballs and build up a world model). I imagine that we'll cross that barrier cheaply in the coming decades, at which point copyright becomes moot. AIs will be able to develop their own styles and world understanding from scratch, then generate original work.
If you flood the market and dominate children's culture with toys from your TV shows, you absolutely cannot complain when your toys are considered iconic enough to be the generic "animated toy". These images don't replace or substitute the things they are depicting.
Enterprises that make content with this also don’t want to infringe on copyright. The AI companies don’t have a good story here. The value has not become evident after years.
Everything that he sees has mysterious flaws that never happen.
Can we all have a moment of silence for poor Bob Iger? Maybe we can start a GoFundMe to help him out?
I'll just do it myself.
That's gonna leave a Marx[1].
also my concern, except it feels like many of LLMs "problems" cant be easily fixed
Recall that according to the US Constitution, copyright can only be on on "science and the useful arts."
Alternately, we could restore a reasonable limit to the duration of copyrights, like 14 years.
"but what if we want to scrape the entire web and something makes it in anyway? see, that is impossible". well that's just saying "fuck it" and using bad data anyway. that's not an actual effort to "not use data you can't use" - there was just no way there'd be a 'rights cleared' way to use the entire web anyway. that is impossible. using a clean dataset is not impossible. it's very possible.
Apart from this, it is mainly large companies that benefit from copyright laws. Why should we have laws that restrict progress just so large capitalist companies can maximize their profits?
I wasn’t shocked when I noticed I could query it about ANY math textbook I owned and it could talk with me about it. I did t bitch and gripe, I enjoyed it and have conversations.
Anyway, I’m in the minority I guess. I love that I can talk with it about books and news.
The paradox should still violate Trademarks due to similarity, but likely cannot infringe on copyright content under prior legal opinion... if at least 80% different from prior art. The lawyers are likely going to have to do a special firm survey to figure this one out.
Bag of popcorn ready =)
the models can be fine
I don't see any developped country pressing the brake on AGI in the near future to protect a few copyright holders from getting "stolen" in hypothetic scenarios.
I hire a session musician to play on my new single, paying him $100. I record the whole session.
I ask him to play the opening to "Stairway to Heaven" and he does so.
"Well, I can't use that as a sample without paying"
"Ok play something like Jimmy Page"
"Hmm, still sounds like Stairway to Heaven"
"Ok, try and sound less like Stairway to Heaven but in that style"
"Great, I'll use that one"
and I release my song and get $5,000 in royalties.
Should I be sued for infringement, or the guitarist?
The problem, I suppose, is that if I had said "play something like 70s prog rock" and he played "Stairway to Heaven" and I didn't know what it was and said "great, I'll use that".
Should I be sued for infringement, or the guitarist?
Remember when everyone and their dog discovered sampling in the late 80's and they all thought they could get away with it because it didn't seem like infringement to the samplers? The courts had no qualms about slapping record labels for putting out records with unlicensed samples in them. Albums even got pulled off shelves while licenses were sorted out.
These companies are charging for a service that returns copyrighted content, full stop. You can't do that whether you are AI or someone drawing Mario and selling the pictures on iStock, or putting out records that sample someone else's work without permission. It took a while in the case of sampling, but it sure as hell happened.
IMO would be best if this stays a highly illegal technology that is only available to a few weirdo nerds /s