They're interpretable in a similar way to how interpretable CNNs are. Not by a coincidence.
For CNNs, we know very well how the early layers work - edge detectors, curve detectors, etc. This understanding decays further into the model. In the brain, V1/V2 are similarly well studied, but it breaks down deeper into the visual cortex - and the sheer architectural complexity there sure doesn't help.