The basics are fundamental to many areas of science (especially if they touch probability in any way), intuitive, and mostly accessible with just a couple of handwaves.
It was done in most horrible and unmotivating way - A4 page or two densely covered with all greek letters and some more, and 98% of the content were just proofs of relatively simple statements. On all tests/exams, only the proofs were tested (so you either gave 1-2 pages of a single proof per question or blank page and could effectively go home as failed).
Subjectively it was the worst set of classes during whole 5 years (and we had some serious IT-unrelated crap because were part of electro-engineering faculty back then), completely mandatory, no credit system back then to make it up via something else. Out of 100 people in 3rd and 4th year, at that point completely focused on Software engineering studies only, maybe 2-3 had proper clue and could do the stuff out of their head.
Needless to say, most people thrown out of university failed exactly these courses, and quite a few were brilliant coders, very successful afterwards. They just couldn't be bothered with bad approach this guy took.
It is very important topic, but should be taught in a sane way. This guy couldn't do it, it alienated the topic to every single student for years to come (even to those few who got it all), and nobody at school dared to challenge him and his methods.
Many seemingly "obvious" facts in information theory are tricky to show. Some examples:
(1) From the article: Cross entropy is always greater than or equal to Entropy since we are coding the wrong distribution. How do you show this? For any two probability vectors (p,q), can we say H(p) >= H(p,q)? Any proof I know involves some delicate usage of Jensen's inequality. (By the way, I feel that the notation used by the author is non-standard. H(p,q) usually stands for the joint entropy, which is quite different.)
(2) Another famous fact about entropy : conditioning always reduces entropy - for any two random variables X, Y, we have H(X|Y) <= H(X) and H(Y|X) <= H(Y). This is called Shannon's inequality, and the proof involves a subtle trick.
(3) You can easily show that if p=q, then KL(p||q)=0. But it is also true that if KL(p||q)=0, then p=q. The second fact is quite tricky, and used to appear as a question in Ph.D qualifying exams.
If you consider giving it another chance.
Note that KL divergence is not symmetric! Eg: If the true distribution of coin tosses is 100% heads and your model has 50/50, you won’t mess up big — compared with when the true coin is 50/50 and your model is 100 percent heads (and you would have been willing to bet a LOT of money that there will be no tails in the outcome).
In this technical sense, it is preferable to be conservative than overly confident.
This is useful because the Hessian of the KL divergence (if it exists) with respect to the parameters of P defines a Riemannian metric called the Fisher information metric. This provides a good distance measure that takes into account how much the information content, or entropy, changes as you move around in parameter space.
This is really useful for fast online variational Bayesian methods. Gradient descent in the Euclidean space of the parameters can be pretty lousy, but using the "natural gradient" using the Fisher information metric gives a more natural definition of distance.
The Fisher information can be derived more generally if the Hessian doesn't exist: it is also the variance of the gradient of the log-likelihood.
This is backwards. Its a tautology.
Any function no matter how egregiously asymmetric is locally symmetric if it is twice differentiable at that point. This is so by construction, you are approximating it locally by the best possible quadratic [hence locally symmetric] curve [surface].
Hence, the claim about symmetry is not false, but it is vacuous. Much like the claim that the equation of French curve is such that no matter how you turn it at its highest point it makes a tangent with the horizontal.
That said Fisher information metric does have many uses.
(Which leads to a general observation of "just throw in transparent compression instead of optimizing your data format")
EDIT: s/encryption/compression/
He also wrote the accompanying text book which is available for free download: http://www.inference.phy.cam.ac.uk/itprnn/book.pdf
I was really impressed by these lectures, and was dismayed to learn that he died from cancer a couple of years ago.
By the way, I find the following way to rewrite the entropy easier to understand because all quantities are positive:
sum(-p_i log(p_i)) = sum(p_i log(1/p_i)) = E[log(1/p_i)]
log(1/p_i) tells you how many bits you need to encode an event with probability p_i. The more unlikely the event, the more bits you need. The entropy is the expected number of bits you need.