Before we had proper GPUs everyone said the same thing about Neural Networks.
Current model architectures are optimized to get the most out of GPUs, which is why we have transformers dominating as they're mostly large dense matrix multiplies.
There's plenty of work showing transformers improve with inner dimension size but it's not feasible to scale them up further because it blows up parameter and activation sizes (including KV caches) so people to turn to low rank ("sparse") decompositions like MLA.
Lottery ticket hypothesis shows that most of the weights in current models are redundant and that we could get away with much smaller sparse models, but currently there's no advantage to doing so because on GPUs you still end up doing dense multiplies.
Plenty of mech interp work shows that models are forced to commingle different concepts to fit them into the "low" dimensional vector space. (https://www.neelnanda.io/mechanistic-interpretability/glossa...)
https://arxiv.org/abs/2210.06313
https://arxiv.org/abs/2305.01610