There's very little difference between a contextual predictive model like this and the guts of a compressor.
If your prediction is good enough that you can always come up with two possible predictions for each character, each of which has a 50% chance of being correct, then obviously you can compress your input down to one bit per character by storing just enough information to tell you which choice to pick. More generally, you can use arithmetic coding to do the same thing with an arbitrary set of letter probabilities, which is exactly what you get as the output of a neural network.
When the blog post says the model achieved a performance of "1.57 bits per character", that's just another way of saying "if we used the neural network as a compressor, this is how well it would perform."