undefined | Better HN

0 pointsgspr17d ago0 comments

> I feel as though, from an information-theoretic standpoint, it can't be possible that an LLM (which is almost certainly <1 TB big) can contain any substantial verbatim portion of its training corpus, which includes audio, images, and videos.

It doesn't need to for my argument to make sense. It's a problem if it reproduces a single copyrighted work (near)-verbatim. Which we have plenty of examples of.

0 comments

NewsaHackO17d ago

Do we? Even when people attempt to jail break most models with 1000s of prompts they are only able to get a paragraph or two of well known copyrighted works and some blocks of paraphrased text, and that's with giving it a substantially leading question.

gsprOP17d ago

It surely doesn't matter how leading or contorted the prompt has to be if it shows that the model is encoding the copyrighted work verbatimly or nearly so.

NewsaHackO16d ago

It definitely does, which is why I put substantial amount of verbatim material. If someone can recite the first paragraph of Harry Potter and the sorcerers stone from memory, it surely doesn't mean they have memorized the entire book.

1 more reply

j / k navigate · click thread line to collapse