undefined | Better HN

0 pointsespadrine2y ago0 comments

There’s a theoretical, but impractical, way: for a given model, each possible set of weight/bias values yields a specific loss value when ran against the full corpus. There’s at least one set of weight values which minimizes it, for which the idealized bit-per-byte entropy can be computed.

That can be compared to what OpenAI’s scaling law paper[0] calls the “entropy of natural language”, which they estimate at about 0.57 bits per byte, based on the differing power law for data vs. compute. In my mind, that highlights more the imprecision of the approach than the information-theoretic content of language semantics: an omniscient being would predict things better, so the closest thing to true entropy should be computed from the list of matching text prefixes among all texts ever.

[0]: https://arxiv.org/pdf/2001.08361

0 comments

1 comments · 1 top-level

XCSme2y ago

Thanks for the explanation!

> should be computed from the list of matching text prefixes among all texts ever

I initially thought that value is pretty low (possible things you can say), but it's probably infinite. Even though, in practice, we don't say too many different things and use a very limited subset of the words in the dictionary.

j / k navigate · click thread line to collapse