LLMs are arguably compressed data archives with weird algorithms. The fact that they will regularly regurgitate verbatim quotes of training data is evidence of this, as are the guardrails that try to prevent this.
The second piece of evidence is this paper explained here https://www.hendrik-erz.de/post/why-gzip-just-beat-a-large-l... where instead of an LLM researchers used gzip compressed data as a model and it even beat trained LLMs.
AI is a bit of a black box, but that doesn’t protect the operators of black boxes from rights violation suits. You can’t make a database of scraped copyrighted data and patented that querying that data is fair use.
There needs to be law made here and the law just isn’t going to be “everybody can copy everything for free as long as it’s for model training”.
Licensing will have to be worked out, actual laws and not just case law needs to be written. I have a lot of sympathy for lots of leeway for the open source researchers and hackers doing things… but not so much for Microsoft and Microsoft sponsored openai.