undefined | Better HN

0 pointsmuyuu3y ago0 comments

Some are, some aren't. See Koala for instance. The problem with Koala is that it fine-tunes on open sourced data, but makes no claims about the data for the base LLaMA models. https://bair.berkeley.edu/blog/2023/04/03/koala/

The irony is that openAI and Meta themselves might be in flaky ground for having trained models on other people data with dubious rights to do so in many instances, and then using it to produce output commercially.

But this is a new frontier and enforcement might be effectively not possible unless new legislation requires reproducibility and audits on the data sets or something like that.

But without that, how do you know exactly how did they arrive at a given set of weights with Montecarlo algorithms and arbitrary fine tuning? You basically don't know what was there and you cannot prove they didn't achieve those results with perfectly clean data.

PS: https://medium.com/geekculture/list-of-open-sourced-fine-tun...

0 pointsmuyuu3y ago0 comments

But this is a new frontier and enforcement might be effectively not possible unless new legislation requires reproducibility and audits on the data sets or something like that.

PS: https://medium.com/geekculture/list-of-open-sourced-fine-tun...

0 comments

3 comments · 1 top-level

Arelius3y ago· 2 in thread

> You basically don't know what was there and you cannot prove they didn't achieve those results with perfectly clean data.

I mean you totally do though, right? You just need one instance of the LLM reproducing information that would only have been able to by violating copyright.

I mean, it's theoretically possible that it could have reproduced it from scratch, infinite monkies on typewriters sort of thing, but statistically we can rule that out on pretty short notice.

Adding on to this, I don't think the argument that OpenAI, Google and others are ultimately making will be that they don't violate copyright, but instead will ultimately be that their violation is sufficiently transformative such that it constitutes fair-use.

muyuuOP3y ago

not only it's theoretically possible, it happens and it can already be observed on clean lab experiments

with normally used parameters the probability that LLMs produce copyrighted information is no proof that it was trained with it exactly, esp. when parameters are set so they don't repeat outputs

Arelius3y ago

I think you'll find that it is in fact proof by all practical standards we use outside of formal mathematics.

1 more reply

j / k navigate · click thread line to collapse