> I get the regularization part, but don't you get essentially the same regularization from using a sparse autoencoder? If the encoder realizes it doesn't have much information, it will turn on few units.
Putting a sparsity loss on z in a regular AE will encourage the code to have smaller magnitudes, and with relu those units will tend to saturate to zero, yes.
But the original point was that even a single continuous unit can be used to transmit an arbitrary amount of information. Not so much that this happens in practice, because the encoder and decoder would need access to something like modulo to do the most obvious kinds of cheating, but just that from an information theory point of view you can't really talk about how much information a continuous variable transmits unless you are transmitting it over a noisy channel and can measure entropies of distributions (and indeed you can formally derive how a given KL loss bounds the information transmitted by z).
> What I don't really intuit is: is it just basically doing regularization, or is the interpretation in terms of learning to infer the posterior meaningful?
Both, which I think is really nice. You can look at it either way.
The Bayesian interpretation is powerful because you now have a principled way to calculate p(x), which you didn't have before. And you can introduce multiple latent variables in your network (as long as no layers take inputs from both ordinary and sampling layers) and so you have some flexibility to do limited forms of graphical modelling that supports efficient forward inference and GPU acceleration. And the inference machinery can be trained via cheap backpropagation instead of expensive sampling.