undefined | Better HN

0 pointssebzim45002y ago0 comments

Couldn't you use the same argument to reach the absurd conclusion that the 7zip source code contains the vast majority of Harry Potter?

A decent control would be to compare it to similar prose that you know for a fact is not in the training data (e.g. because it was written afterwards).

0 comments

1 comments · 1 top-level

aimor2y ago

I think the same argument would have to compare 7zip's compression to some other compression algorithm. Then we can say things like "7zip is a better/worse model of human writing". And that's probably a better way to talk about this as well.

You're right that a better baseline could be made using books not in the training set, to understand how much is the model learning prose and how much is learning a specific book.

j / k navigate · click thread line to collapse