You would think having a massive scale just means it has infringed even more copyrights, and therefore should be in even more hot water
If you're only training on a handful of works then you're taking more from them, meaning it's not de minimus.
For the record, I got this legal theory from Cory Doctorow[0], but I'm skeptical. It's very plausible, but at the same time, we also thought sampling in music was de minimus until the Second Circuit said otherwise. Copyright law is extremely malleable in the presence of moneyed interests, sometimes without Congressional intervention even!
[0] who is NOT pro-AI, he just thinks labor law is a better bulwark against it than copyright
The word-probabilities are transformative use, a form of fair use and aren't an issue.
The specific output at each point in time is what would be judged to be fair use or copyright infringing.
I'd argue the user would be responsible for ensuring they're not infringing by using the output in a copyright infringing manner i.e. for profit, as they've fed certain inputs into the model which led to the output. In the same way you can't sue Microsoft for someone typing up copyrighted works into Microsoft Word and then distributing for profit.
De minimus is still helpful here, not all infringments are noteworthy.
It's up to you if that counts as "a handful" or not.
If you were a director at a game company and needed art in that style, it would be cheaper to have the AI do it instead of buying from the artist.
I think this is currently an open question.
An ai-enhanced Photoshop, however, could do wonders though as the base capabilities seem to be mostly there. Haven't used any of the newer ai stuff myself but https://www.shruggingface.com/blog/how-i-used-stable-diffusi... makes it pretty clear the building blocks seem largely there. So my guess is the main disconnect is in making the machines understand natural language instructions for how to change the art.
I would think if I can recognize exactly what song it comes from - not de minimus.
But most people don't want to live in permanent mental distress due to shame of past action or fear of rebellion, I guess.
More generally, we tend to view number of causalities in war as a large number, and not as the sum of every tragedies that it represent and that we perceive when fewer people die.
My point isn't to argue merits of that case, it's just to point out that OP's joke is like a stereotypical output of an LLM: seems to make sense, but really doesn't.
1) the purpose and character of use.
2) the nature of the copyrighted material.
3) the *amount* and *substantiality* of the portion taken, and.
4) the effect of the use upon the *potential market*.
So in that regard, if you're training a personal assistance GPT, and use some software code to teach your model logic, that is easy to defend as fair use.
But the extent of use matters, and if you're training an AI for the sole purpose of regurgitating specific copyrighted material, it is infringement, if it is copyrighted, but in this case, it is not copyright issue, it is contracts and NDAs.