I think my main point here is that “legal” does not imply moral or acceptable to society, and our understanding of the technical legal status is not a prerequisite for exploring those factors, which may be the thing that changes the legal status in response to the major shift in landscape.
You risk nothing by assuming things are legal until explicitly illegal.
They are not similar, and I suspect that if they were (i.e. humans could absorb that much information), the information landscape and the market models for exchanging value would look nothing like they do today, and AI wouldn’t be rocking the boat, it’d just be another adherent to the resulting laws.
You can't take a regime that works decently with human-rate copying and convert it to computer-rate copying, because fundamentally the give-and-take of rights to each side is balanced against feasible limits of reproduction.
Or, to put it another way, if you can copy/synthesize at most 1 book a day, I can extend you a lot more implicit rights... than I can afford to someone who can copy/synthesize every book ever in a day.
IMO google and their massive google books DB would have a better leg to stand on here if they trained on that dataset as they owned physical copies of all the books.
The problem with current AI is that they memorize stuff, there is the case with the AI memorizing an algorithm perfectly, or reciting quotes from Dune and then getting censored.
Now you as a paying user of this AI tools are not making reviews but probably using them for commercial purposes and it would not be fiar if your proprietary code would use code copy pasted from GPL code.
If this AI would be so clever then IMO you could have them laarn say Python exactly like a human, a few books and some exercises on python, some books on algorithms, some books on html or whatever tech. But today they train with the full github and you get a mix of stuff. My suggestion would also improve the sorry state of JS in ChatGPT where it uses super old syntax and still uses outdated pattern like it is coding for IE6. My guess this is because it is train with old or bad code and this mean a=most of the code from now one will be old syntax and bad
Citation needed.