You’re confusing several different ideas here.
The idea you’re talking about is called “the bitter lesson.” It (very basically) says that a model with more compute put behind it will perform better than a cleverer method which may use less compute. Has nothing to do with being “generic.” It’s also worth noting that, afaik, it’s an accurate observation, but not a law or a fact. It may not hold forever.
Either way, I’m not arguing against that. I’m saying that LLMs are too general to be useful in specific, specialized, domains.
Sure bigger generic models perform (increasingly marginally) better at the benchmarks we’ve cooked up, but they’re still too general to be that useful in any specific context. That’s the entire reason RAG exists in the first place.
I’m saying that a language model trained on a specific domain will perform better at tasks in that domain than a similar sized model (in terms of compute) trained on a lot of different, unrelated text.
For instance, a model trained specifically on code will produce better code than a similarly sized model trained on all available text.
I really hope that example makes what I’m saying self-evident.