undefined | Better HN

0 pointsmoomin1y ago0 comments

A good way of thinking about this is: consider the case where the data in question is illegal. Could you get into trouble for not only having access to it but also making copies of it?

There’s plenty of case law there…

0 comments

lsaferite1y ago

I would argue that as an individual, real person, obtaining content without a license and personally consuming that content is significantly different than a corporation doing the same. My rational is that distribution of that content is (or should be) the primary offense. If I work for a company and they direct me to collect a bunch of content without a license and then I pass that to other members in my team to train a model, I've now distributed that content at the direction of my employer. That should be the offense the company is tried for.

Using content to train an LLM is not copying the content. I'm ignoring the silly "but actually" arguments about the content being in RAM so it's "copying". It's using the content to generate a statistical model of token (word-ish) relationships and probabilities. If you write content that is so original in it's wording and I train an LLM against it, then there is certainly the possibility that the LLM could be provoked the recall the exact words you used. You'd have to set the parameters just right to make it happen and I think that proper training would drastically lower if not remove that possible scenario. But even if it doesn't, the LLM doesn't have a copy of that original content. All it has is weights representing those relationship probabilities. Yes, the minutia is more complex, but that is the essence. If my LLM were to generate enough of this essentially verbatim unique content and I tried to publish or copyright it, then I as the user should be on the hook. But then you get into a discussion about how many words in a unique sequence does it take to be infringement?

Obviously, I am not a lawyer.

My summation in all of this is that new laws need to be put into place to handle this stuff because the existing ones are sufficiently non-definitive and/or ill-suited such that every party is forming strong opinions about how old laws apply to new situations and causing massive friction.

j / k navigate · click thread line to collapse

0 comments

lsaferite1y ago

Obviously, I am not a lawyer.

j / k navigate · click thread line to collapse