They pay for it.
They buy the books. They buy tickets to theatre. They buy entrance to the gallery.
The trick that's being done now is hey, we don't have to pay since it's not a person. (to the creator) But hey, it is just like a person when it learns! (legal system)
If AI models require human training data, then they should pay for it. Easy.
Books3 has separate legal concerns, but Google has a legally acquired corpus of tons of books, which they've mostly cleaned up from scans (probably far better than IA has), and have probably used to train Bard on. Their lawyers must be biting their nails waiting to see how these lawsuits turn out, though.
Until AGI arrives, or some other method of training LLMs from the ground up on sparse examples by incrementally building on structural knowledge of language.... training on ridiculous amounts of copyrighted content is required. Not because anyone wants to copy those works, but because training that way fills in for a lack of real-world experience that every child gets, which includes consuming and interacting with a bunch of copyrighted content that isn't tracked because it's not practical to do so.
You could train a LLM only on project gutenberg, and the LLM would churn out stilted English and the occasional iambic pentameter. That's great if you want works that seem like they were written over a century ago, but nearly useless otherwise.
>Until AGI arrives, or some other method of training LLMs from the ground up on sparse examples by incrementally building on structural knowledge of language.... training on ridiculous amounts of copyrighted content is required.
that's not my problem. Those AI model folk should just compensate the people they're using training data from, and they should ask for permission.
>You could train a LLM only on project gutenberg, and the LLM would churn out stilted English and the occasional iambic pentameter. That's great if you want works that seem like they were written over a century ago, but nearly useless otherwise.
Not my problem. Why are the problems of the wonderful AI developers suddenly human, global problems that we all have to find a way to fix?
If they want access to training data - they should pay for the privielige.
First, there's this thing called a deposit library: https://en.wikipedia.org/wiki/Legal_deposit
Second: I, as a user of the service, who learns things, still pay nothing.
Should I be required to directly pay for the things I learned from, or is it sufficient that someone is? Because if the latter, then picking up a book from a normal (non-deposit) library, showing it to an OCR system, and having an AI learn from that, would involve just as much payment as I ever made to read a library book (with the possible exception of late return fines, I can't remember if I ever had any of those).
Physical library books are governed by the doctrine of first sale. That's why google has one of the largest (maybe excluding l-bg-n and IA) corpus of books on the internet. They might have the cleanest corpus of OCR'd book content of anyone, since IA uses commercial or open source OCR and that's it, while google for a long time used recaptcha to check OCR results.
For physical books, the cost per read of a library book is an order of magnitude smaller than the cost per read of privately purchased books. How can you tolerate the economic model of libraries when the net effect is a theft of maybe 80%-95% from the author and publisher? Libraries subsidize books that nobody wanted to read, but steal from authors and publishers whose books are read multiple times per physical copy.
Even libraries' onerous ebook licenses are not commercial retail ebook pricing. They're just closer to retail pricing than the publishers could ever manage with physical books, because there's no pesky right of first sale which turns physical book libraries into piracy havens.
I would prefer to get away from OpenAI and Facebook and all the other people using potentially tainted sources like books3. The obvious legal question for them isn't whether training was legal, but whether the acquisition of the training data was legal. That's a straightforward copyright issue, or at least as straightforward as fair use determinations can ever be. Whether we agree with copyright law as it stands, it's certain that copyright applies when books3 is transferred around the internet. How transformative it is, how much the transfer of books3 affects the market, and the other two factors, make those actions fair use, are the only questions to be considered.
The training aspect is where all the difference of opinion lies:
What is your position on Google using its corpus of books (legally acquired and possessed, as the content behind google books) to train a LLM? Do they need to acquire additional rights from copyright holders? Why, and under what legal theory?
How would they get permission ahead of time? How would they agree to a pricing model? Would they spend tens or hundreds of millions of dollars training a model, and only then negotiate with rights holders to find out whether the license fees they want will be economically viable? We all know that most major rights holders would never grant a one-time license fee. It would be perpetual rent-seeking from AI output. I don't see how any of these LLM or image generation models would be economical if rights holders had their way. They wouldn't mind. They're notoriously slow to adopt tech, but if they did anything, they'd hire AI experts, build their own models, and license the models back to Google and Microsoft.
This doesn't cover everything: I did, indeed, also buy books on HTML and JS, and my first C compiler, and a licence to REALbasic[1]. But that doesn't refute the fact that I did learn a lot for free.
> If AI models require human training data, then they should pay for it. Easy.
You can do that if you like, but that won't stop any of the economic issues that arise. The cost of running Stable Diffusion is so low that even if you had literal slaves, and you were spending only the UN extreme poverty threshold on keeping them alive and housed, the pro-rata cost of keeping them alive for long enough to type in the prompt dominates the total cost of making images.
Right now these models are still, despite their impressiveness, flawed: while an artist can use them to great effect, most of us will have our generations easily spotted by some flaw we have never trained ourselves to notice. If the models become good enough to fully replace all artists, the only way the profession called "artist" isn't going to go the same way as the profession called "computer" is if the arts are to humans as fancy tails are to peacocks: the effort being the point, extravagantly wasting effort to show you're fit enough to manage fine despite the penalty.
[0] UK rules at the time, thanks to my dad's early retirement and therefore "low income" status
[1] as it was so named at the time, Xojo now
This will happen in every industry and occupation, one by one.
Is this what you think is desirable?
The alternative is perversely simple: PAY for the right to use training data.
And still will even if they (for any value of "they") do pay.
Unless… do you want them to pay the entire future economic value that, say, all programmers including myself might have added if we weren't about to be made redundant by the next coding LLM?
Because the historical analogy there is getting Raspberry Pi to pay out the entire global GDP for each Pi Zero, on the grounds of that model being able to do arithmetic as fast as the entire world, even if the entire world had been paid to work in the obsolete job role of "computer", after having been trained to operate reliably at the speed of the current world record holder.
> Is this what you think is desirable?
Post-scarcity economics, AKA "fully automated luxury communism" (not the book of the same name): https://en.wikipedia.org/wiki/Post-scarcity
They're saying, "You can't consume my book in that way."