I am by no means dismissing the power. They are created very chaotically, however. Spaghetti thrown at a wall. They are brute force approximations.
They are wasteful. If LLaMa 13B is as powerful as previous 65B models, that's a significant amount of unnecessary paramaters lost/pruned in just this iterative upgrade alone. How small can they go? The fewest parameters that get the job done 99% as well is the way to go.
There is also the difference between the rules and use of language being directly compressed into the model, vs all the information known to humans compressed into the model. A smaller model that ingests relevant information on the fly (more like Bing, that supplements itself with search), may be less wasteful and perform better.
The current models being released are chosen because "they work" not because they are least fit and most performant optimized.