undefined | Better HN

0 pointskranke1552y ago0 comments

Yes, and you know how humans acquire works to learn from?

They pay for it.

They buy the books. They buy tickets to theatre. They buy entrance to the gallery.

The trick that's being done now is hey, we don't have to pay since it's not a person. (to the creator) But hey, it is just like a person when it learns! (legal system)

If AI models require human training data, then they should pay for it. Easy.

0 comments

8 comments · 3 top-level

harshreality2y ago· 3 in thread

False. Libraries exist. Borrowing books from neighborhood libraries or friends exists. Watching movies and TV with friends exists. Listening to music on the radio (yes, those free electromagnetic thingies) still exists. There are many, many, many free performances or accessible copies of all kinds of copyrighted content, plenty to train either a neural net or a human brain on.

Books3 has separate legal concerns, but Google has a legally acquired corpus of tons of books, which they've mostly cleaned up from scans (probably far better than IA has), and have probably used to train Bard on. Their lawyers must be biting their nails waiting to see how these lawsuits turn out, though.

Until AGI arrives, or some other method of training LLMs from the ground up on sparse examples by incrementally building on structural knowledge of language.... training on ridiculous amounts of copyrighted content is required. Not because anyone wants to copy those works, but because training that way fills in for a lack of real-world experience that every child gets, which includes consuming and interacting with a bunch of copyrighted content that isn't tracked because it's not practical to do so.

You could train a LLM only on project gutenberg, and the LLM would churn out stilted English and the occasional iambic pentameter. That's great if you want works that seem like they were written over a century ago, but nearly useless otherwise.

kranke155OP2y ago

Libraries exist? Do you think books fly onto library shelves for free? As far as I know, someone bought them. Your neighbour or friend also bought the stuff. I suspect you're not being straight here, I just have to ignore this whole line of reasoning since it seems so absurd.

>Until AGI arrives, or some other method of training LLMs from the ground up on sparse examples by incrementally building on structural knowledge of language.... training on ridiculous amounts of copyrighted content is required.

that's not my problem. Those AI model folk should just compensate the people they're using training data from, and they should ask for permission.

>You could train a LLM only on project gutenberg, and the LLM would churn out stilted English and the occasional iambic pentameter. That's great if you want works that seem like they were written over a century ago, but nearly useless otherwise.

Not my problem. Why are the problems of the wonderful AI developers suddenly human, global problems that we all have to find a way to fix?

If they want access to training data - they should pay for the privielige.

ben_w2y ago

> Libraries exist? Do you think books fly onto library shelves for free? As far as I know, someone bought them. Your neighbour or friend also bought the stuff. I suspect you're not being straight here, I just have to ignore this whole line of reasoning since it seems so absurd.

First, there's this thing called a deposit library: https://en.wikipedia.org/wiki/Legal_deposit

Second: I, as a user of the service, who learns things, still pay nothing.

Should I be required to directly pay for the things I learned from, or is it sufficient that someone is? Because if the latter, then picking up a book from a normal (non-deposit) library, showing it to an OCR system, and having an AI learn from that, would involve just as much payment as I ever made to read a library book (with the possible exception of late return fines, I can't remember if I ever had any of those).

1 more reply

harshreality2y ago

I think humans should pay for permission to learn. Heaven forbid copyright holders don't get paid for all the material they've put out that humans are using (often stealing) to learn from in order to become useful members of society!

Physical library books are governed by the doctrine of first sale. That's why google has one of the largest (maybe excluding l-bg-n and IA) corpus of books on the internet. They might have the cleanest corpus of OCR'd book content of anyone, since IA uses commercial or open source OCR and that's it, while google for a long time used recaptcha to check OCR results.

For physical books, the cost per read of a library book is an order of magnitude smaller than the cost per read of privately purchased books. How can you tolerate the economic model of libraries when the net effect is a theft of maybe 80%-95% from the author and publisher? Libraries subsidize books that nobody wanted to read, but steal from authors and publishers whose books are read multiple times per physical copy.

Even libraries' onerous ebook licenses are not commercial retail ebook pricing. They're just closer to retail pricing than the publishers could ever manage with physical books, because there's no pesky right of first sale which turns physical book libraries into piracy havens.

I would prefer to get away from OpenAI and Facebook and all the other people using potentially tainted sources like books3. The obvious legal question for them isn't whether training was legal, but whether the acquisition of the training data was legal. That's a straightforward copyright issue, or at least as straightforward as fair use determinations can ever be. Whether we agree with copyright law as it stands, it's certain that copyright applies when books3 is transferred around the internet. How transformative it is, how much the transfer of books3 affects the market, and the other two factors, make those actions fair use, are the only questions to be considered.

The training aspect is where all the difference of opinion lies:

What is your position on Google using its corpus of books (legally acquired and possessed, as the content behind google books) to train a LLM? Do they need to acquire additional rights from copyright holders? Why, and under what legal theory?

How would they get permission ahead of time? How would they agree to a pricing model? Would they spend tens or hundreds of millions of dollars training a model, and only then negotiate with rights holders to find out whether the license fees they want will be economically viable? We all know that most major rights holders would never grant a one-time license fee. It would be perpetual rent-seeking from AI output. I don't see how any of these LLM or image generation models would be economical if rights holders had their way. They wouldn't mind. They're notoriously slow to adopt tech, but if they did anything, they'd hire AI experts, build their own models, and license the models back to Google and Microsoft.

ben_w2y ago· 2 in thread

I had a free school education (including Shakespeare and Ethan Frome, both of which are out of copyright now though only the former when I studied it); several free libraries; and with the exception of my final year even my university tuition was free[0]; after graduation the museums I went to were also free; I watched free educational videos from Apple Developer and YouTube, and listened to free podcasts; I have learned things from reading Wikipedia; and I have done free online courses in both natural languages and programming languages.

This doesn't cover everything: I did, indeed, also buy books on HTML and JS, and my first C compiler, and a licence to REALbasic[1]. But that doesn't refute the fact that I did learn a lot for free.

> If AI models require human training data, then they should pay for it. Easy.

You can do that if you like, but that won't stop any of the economic issues that arise. The cost of running Stable Diffusion is so low that even if you had literal slaves, and you were spending only the UN extreme poverty threshold on keeping them alive and housed, the pro-rata cost of keeping them alive for long enough to type in the prompt dominates the total cost of making images.

Right now these models are still, despite their impressiveness, flawed: while an artist can use them to great effect, most of us will have our generations easily spotted by some flaw we have never trained ourselves to notice. If the models become good enough to fully replace all artists, the only way the profession called "artist" isn't going to go the same way as the profession called "computer" is if the arts are to humans as fancy tails are to peacocks: the effort being the point, extravagantly wasting effort to show you're fit enough to manage fine despite the penalty.

[0] UK rules at the time, thanks to my dad's early retirement and therefore "low income" status

[1] as it was so named at the time, Xojo now

kranke155OP2y ago

If we let this idea that "AI training data usage has no compensation for rights owners" to be become ensconced in the legal system, then all human endeavour will become fair game to be acquired by someone to make a Machine Intelligence out of, and remove you completely out of the profit loop of your own work.

This will happen in every industry and occupation, one by one.

Is this what you think is desirable?

The alternative is perversely simple: PAY for the right to use training data.

ben_w2y ago

> This will happen in every industry and occupation, one by one.

And still will even if they (for any value of "they") do pay.

Unless… do you want them to pay the entire future economic value that, say, all programmers including myself might have added if we weren't about to be made redundant by the next coding LLM?

Because the historical analogy there is getting Raspberry Pi to pay out the entire global GDP for each Pi Zero, on the grounds of that model being able to do arithmetic as fast as the entire world, even if the entire world had been paid to work in the obsolete job role of "computer", after having been trained to operate reliably at the speed of the current world record holder.

> Is this what you think is desirable?

Post-scarcity economics, AKA "fully automated luxury communism" (not the book of the same name): https://en.wikipedia.org/wiki/Post-scarcity

1 more reply

beej712y ago

I don't think this is the argument that's being made, though. They're not saying, "This is a clear cut case of piracy--pay me for that book."

They're saying, "You can't consume my book in that way."

j / k navigate · click thread line to collapse

0 comments

8 comments · 3 top-level

harshreality2y ago· 3 in thread

kranke155OP2y ago

that's not my problem. Those AI model folk should just compensate the people they're using training data from, and they should ask for permission.

Not my problem. Why are the problems of the wonderful AI developers suddenly human, global problems that we all have to find a way to fix?

If they want access to training data - they should pay for the privielige.

ben_w2y ago

First, there's this thing called a deposit library: https://en.wikipedia.org/wiki/Legal_deposit

Second: I, as a user of the service, who learns things, still pay nothing.

1 more reply

harshreality2y ago

The training aspect is where all the difference of opinion lies:

ben_w2y ago· 2 in thread

This doesn't cover everything: I did, indeed, also buy books on HTML and JS, and my first C compiler, and a licence to REALbasic[1]. But that doesn't refute the fact that I did learn a lot for free.

> If AI models require human training data, then they should pay for it. Easy.

[0] UK rules at the time, thanks to my dad's early retirement and therefore "low income" status

[1] as it was so named at the time, Xojo now

kranke155OP2y ago

This will happen in every industry and occupation, one by one.

Is this what you think is desirable?

The alternative is perversely simple: PAY for the right to use training data.

ben_w2y ago

> This will happen in every industry and occupation, one by one.

And still will even if they (for any value of "they") do pay.

Unless… do you want them to pay the entire future economic value that, say, all programmers including myself might have added if we weren't about to be made redundant by the next coding LLM?

> Is this what you think is desirable?

Post-scarcity economics, AKA "fully automated luxury communism" (not the book of the same name): https://en.wikipedia.org/wiki/Post-scarcity

1 more reply

beej712y ago

I don't think this is the argument that's being made, though. They're not saying, "This is a clear cut case of piracy--pay me for that book."

They're saying, "You can't consume my book in that way."

j / k navigate · click thread line to collapse