I think the same argument would have to compare 7zip's compression to some other compression algorithm. Then we can say things like "7zip is a better/worse model of human writing". And that's probably a better way to talk about this as well.
You're right that a better baseline could be made using books not in the training set, to understand how much is the model learning prose and how much is learning a specific book.