What happens is that you get very spikey activations, there are so called "outlier" activations. A easy to read paper that tells you about this is SmoothQuant [0]. Another source from Anthropic and the Mechanistic Interperability people is calling these "privileged basis" [1].
Now based on the weight symmetries of a typical transformer, these actually don't need to exist. Weight symmetries means the ways you can change the weights without actually affecting the mathematical function, there are a broad class of these because the linear algebra has a lot of redundancies in it.
But the behaviour of the Adam optimizer is such that you do end up w/ these things because it sort of more quickly optimizes to produce them. This comes from the fact it is an elementwise dynamic learning rate (and probably partly to do with the epsilon).
[0] https://arxiv.org/pdf/2211.10438 [1] https://transformer-circuits.pub/2023/privileged-basis/index...