undefined | Better HN

0 pointsint_19h10mo ago0 comments

In the general case, yes, but they can verifiably reproduce at least some copyrighted works verbatim, which implies, at the minimum, that their content is stored in model weights in some fashion.

0 comments

johanyc10mo ago

Everyone knows the training data is stored in some way in the LLM. The point is the use of the copyrighted material is transformative. Remember google books, it literally shows photocopy of pages of books but the court ruled it’s fair use. A simplified explanation is book vs search engine and book vs ai chatbot are very different from each other.

SideburnsOfDoom10mo ago

"a photocopy of pages of books" is exactly that, pages of an existing book. It doesn't pretend to be something else.

The output of a LLM, when based heavily on that same page, pretends to be something novel. IMadeThis_Meme.gif

lsaferite10mo ago

It implies that the token procession probability was unique enough that with a low entropy token stream and the proper starting token stream you could recreate portions of the original content *strictly based on probabilities*.

j / k navigate · click thread line to collapse