What is the Kullback-Leibler divergence? (opens in new tab)

(saru.science)

105 pointsrgbimbochamp7y ago32 comments

32 comments

21 comments · 9 top-level

beagle37y ago· 6 in thread

I wish information theory was part of math/cs/engineering curriculum in more places.

The basics are fundamental to many areas of science (especially if they touch probability in any way), intuitive, and mostly accessible with just a couple of handwaves.

saiya-jin7y ago

We had it in our university, actually quite deep. It was done by head of IT department on our faculty, long-retired guy who was supposedly brilliant as theoretical scientist and had high reputation all over Europe in his field.

It was done in most horrible and unmotivating way - A4 page or two densely covered with all greek letters and some more, and 98% of the content were just proofs of relatively simple statements. On all tests/exams, only the proofs were tested (so you either gave 1-2 pages of a single proof per question or blank page and could effectively go home as failed).

Subjectively it was the worst set of classes during whole 5 years (and we had some serious IT-unrelated crap because were part of electro-engineering faculty back then), completely mandatory, no credit system back then to make it up via something else. Out of 100 people in 3rd and 4th year, at that point completely focused on Software engineering studies only, maybe 2-3 had proper clue and could do the stuff out of their head.

Needless to say, most people thrown out of university failed exactly these courses, and quite a few were brilliant coders, very successful afterwards. They just couldn't be bothered with bad approach this guy took.

It is very important topic, but should be taught in a sane way. This guy couldn't do it, it alienated the topic to every single student for years to come (even to those few who got it all), and nobody at school dared to challenge him and his methods.

sn417y ago

I actually sympathize with the theoretician (disclaimer: I work in information-theoretic areas). Information theory is easy to motivate at a first cut, but if you want to really understand it, then there are some hairy issues. There is a lot of slip between the cup and the lip when it comes to information theory (Shannon himself made several serious errors in his original 1948 paper which took decades to fully work out).

Many seemingly "obvious" facts in information theory are tricky to show. Some examples:

(1) From the article: Cross entropy is always greater than or equal to Entropy since we are coding the wrong distribution. How do you show this? For any two probability vectors (p,q), can we say H(p) >= H(p,q)? Any proof I know involves some delicate usage of Jensen's inequality. (By the way, I feel that the notation used by the author is non-standard. H(p,q) usually stands for the joint entropy, which is quite different.)

(2) Another famous fact about entropy : conditioning always reduces entropy - for any two random variables X, Y, we have H(X|Y) <= H(X) and H(Y|X) <= H(Y). This is called Shannon's inequality, and the proof involves a subtle trick.

(3) You can easily show that if p=q, then KL(p||q)=0. But it is also true that if KL(p||q)=0, then p=q. The second fact is quite tricky, and used to appear as a question in Ph.D qualifying exams.

2 more replies

beagle37y ago

If you are math inclined, Cover & Thomas’ “elements of information theory” is a very clear and readable introduction (more for coding, less for channels, but great overall).

If you consider giving it another chance.

1 more reply

duality7y ago

I agree. Information Theory is _the_ ur-science. It's the science of matching mathematical models to data, and so it's where inference meets the deduction. Such an important and fascinating subject should be widely taught and appreciated.

JoeAltmaier7y ago

Well, next to math. And logic. And ethics.

1 more reply

SilasX7y ago

I saw a signature on slashdot: "Information theory is life. The rest is just the KL divergence."

ssivark7y ago· 4 in thread

To summarize succinctly, KL(q||p) quantifies how badly you screw up if the true distribution is “q” and you instead think it is “p”.

Note that KL divergence is not symmetric! Eg: If the true distribution of coin tosses is 100% heads and your model has 50/50, you won’t mess up big — compared with when the true coin is 50/50 and your model is 100 percent heads (and you would have been willing to bet a LOT of money that there will be no tails in the outcome).

In this technical sense, it is preferable to be conservative than overly confident.

cultus7y ago

As a side note, KL divergence is actually symmetric to the second order: If you have a distribution "p" parameterized by x, the divergence between p|x_0 from a nearby value p|x_1 is approximately symmetric.

This is useful because the Hessian of the KL divergence (if it exists) with respect to the parameters of P defines a Riemannian metric called the Fisher information metric. This provides a good distance measure that takes into account how much the information content, or entropy, changes as you move around in parameter space.

This is really useful for fast online variational Bayesian methods. Gradient descent in the Euclidean space of the parameters can be pretty lousy, but using the "natural gradient" using the Fisher information metric gives a more natural definition of distance.

The Fisher information can be derived more generally if the Hessian doesn't exist: it is also the variance of the gradient of the log-likelihood.

srean7y ago

> As a side note, KL divergence is actually symmetric to the second order:

This is backwards. Its a tautology.

Any function no matter how egregiously asymmetric is locally symmetric if it is twice differentiable at that point. This is so by construction, you are approximating it locally by the best possible quadratic [hence locally symmetric] curve [surface].

Hence, the claim about symmetry is not false, but it is vacuous. Much like the claim that the equation of French curve is such that no matter how you turn it at its highest point it makes a tangent with the horizontal.

That said Fisher information metric does have many uses.

shmageggy7y ago

Very interesting and useful. Got any good references for further reading?

ASpring7y ago

Good point on KL divergence not being symmetric. For anyone who wants to quantify KL divergence in a way that can be used as a metric, look into Jensen-Shannon distance which is based on KL-divergence.

caiocaiocaio7y ago· 1 in thread

Lovely article, but grey-on-white and a small, thin display font meant I had to go into developer tools to be able to read it without getting a headache.

h2onock7y ago

It was nice and clear on my phone

doombolt7y ago· 1 in thread

I have a hunch that space engineers have suddently invented Huffman coding.

(Which leads to a general observation of "just throw in transparent compression instead of optimizing your data format")

EDIT: s/encryption/compression/

cryptonector7y ago

I don't know why you got downvoted with no explanation. My observation is the same: this is just about measuring the efficiency of one's Huffman tables given actual probability distribution after the fact.

Patient07y ago

I've recently discovered this excellent lecture series by David Mackay available on YouTube: https://youtu.be/y5VdtQSqiAI

He also wrote the accompanying text book which is available for free download: http://www.inference.phy.cam.ac.uk/itprnn/book.pdf

I was really impressed by these lectures, and was dismayed to learn that he died from cancer a couple of years ago.

atrudeau7y ago

Shannon's dissertation is a great introduction (:p) to entropy. https://dspace.mit.edu/handle/1721.1/11173

cryptonector7y ago

This divergence feels a lot like making a Huffman encoding table given a prediction of probability distribution then measuring how efficient that turns out to be by comparison to a Huffman encoding table based on the probability distribution you get from the real data after the fact.

jules7y ago

The KL divergence is also called relative entropy. Unlike the ordinary entropy, relative entropy is invariant under parameter transformations. The maximum relative entropy principle generalises Bayesian inference. The distribution relative to which you're computing the entropy plays the role of the prior.

By the way, I find the following way to rewrite the entropy easier to understand because all quantities are positive:

sum(-p_i log(p_i)) = sum(p_i log(1/p_i)) = E[log(1/p_i)]

log(1/p_i) tells you how many bits you need to encode an event with probability p_i. The more unlikely the event, the more bits you need. The entropy is the expected number of bits you need.

derEitel7y ago

Great, intuitive explanations with a nice mix of code and formulas. Only I found the GIFs to be very annoying while reading, especially as they do not add to the content.

j / k navigate · click thread line to collapse

32 comments

21 comments · 9 top-level

beagle37y ago· 6 in thread

I wish information theory was part of math/cs/engineering curriculum in more places.

The basics are fundamental to many areas of science (especially if they touch probability in any way), intuitive, and mostly accessible with just a couple of handwaves.

saiya-jin7y ago

sn417y ago

Many seemingly "obvious" facts in information theory are tricky to show. Some examples:

(3) You can easily show that if p=q, then KL(p||q)=0. But it is also true that if KL(p||q)=0, then p=q. The second fact is quite tricky, and used to appear as a question in Ph.D qualifying exams.

2 more replies

beagle37y ago

If you are math inclined, Cover & Thomas’ “elements of information theory” is a very clear and readable introduction (more for coding, less for channels, but great overall).

If you consider giving it another chance.

1 more reply

duality7y ago

JoeAltmaier7y ago

Well, next to math. And logic. And ethics.

1 more reply

SilasX7y ago

I saw a signature on slashdot: "Information theory is life. The rest is just the KL divergence."

ssivark7y ago· 4 in thread

To summarize succinctly, KL(q||p) quantifies how badly you screw up if the true distribution is “q” and you instead think it is “p”.

In this technical sense, it is preferable to be conservative than overly confident.

cultus7y ago

The Fisher information can be derived more generally if the Hessian doesn't exist: it is also the variance of the gradient of the log-likelihood.

srean7y ago

> As a side note, KL divergence is actually symmetric to the second order:

This is backwards. Its a tautology.

That said Fisher information metric does have many uses.

shmageggy7y ago

Very interesting and useful. Got any good references for further reading?

ASpring7y ago

caiocaiocaio7y ago· 1 in thread

Lovely article, but grey-on-white and a small, thin display font meant I had to go into developer tools to be able to read it without getting a headache.

h2onock7y ago

It was nice and clear on my phone

doombolt7y ago· 1 in thread

I have a hunch that space engineers have suddently invented Huffman coding.

(Which leads to a general observation of "just throw in transparent compression instead of optimizing your data format")

EDIT: s/encryption/compression/

cryptonector7y ago

Patient07y ago

I've recently discovered this excellent lecture series by David Mackay available on YouTube: https://youtu.be/y5VdtQSqiAI

He also wrote the accompanying text book which is available for free download: http://www.inference.phy.cam.ac.uk/itprnn/book.pdf

I was really impressed by these lectures, and was dismayed to learn that he died from cancer a couple of years ago.

atrudeau7y ago

Shannon's dissertation is a great introduction (:p) to entropy. https://dspace.mit.edu/handle/1721.1/11173

cryptonector7y ago

jules7y ago

By the way, I find the following way to rewrite the entropy easier to understand because all quantities are positive:

sum(-p_i log(p_i)) = sum(p_i log(1/p_i)) = E[log(1/p_i)]

log(1/p_i) tells you how many bits you need to encode an event with probability p_i. The more unlikely the event, the more bits you need. The entropy is the expected number of bits you need.

derEitel7y ago

Great, intuitive explanations with a nice mix of code and formulas. Only I found the GIFs to be very annoying while reading, especially as they do not add to the content.

j / k navigate · click thread line to collapse