The size of the embedding space (number of vector dimensions) is therefore larger than needed to just represent word meanings - it needs to be large enough to also be able to represent the information added by these layer-wise transformations.
The way I think of these transformations, but happy to be corrected, is more a matter of adding information rather than modifying what is already there, so conceptually the embeddings will start as word embeddings, then maybe get augmented with part-of-speech information, then additional syntactic/parsing information, and semantic information, as the embedding gets incrementally enriched as it is "transformed" by successive layers.