Many seemingly "obvious" facts in information theory are tricky to show. Some examples:
(1) From the article: Cross entropy is always greater than or equal to Entropy since we are coding the wrong distribution. How do you show this? For any two probability vectors (p,q), can we say H(p) >= H(p,q)? Any proof I know involves some delicate usage of Jensen's inequality. (By the way, I feel that the notation used by the author is non-standard. H(p,q) usually stands for the joint entropy, which is quite different.)
(2) Another famous fact about entropy : conditioning always reduces entropy - for any two random variables X, Y, we have H(X|Y) <= H(X) and H(Y|X) <= H(Y). This is called Shannon's inequality, and the proof involves a subtle trick.
(3) You can easily show that if p=q, then KL(p||q)=0. But it is also true that if KL(p||q)=0, then p=q. The second fact is quite tricky, and used to appear as a question in Ph.D qualifying exams.