We might soon get to a point where every player is using pretty much all the low-cost data there is. Everyone will use all the public internet data there is, augmented by as much private datasets as they can afford.
The improvements we can expect to see in the next few years look like a Drake equation.
LLM performance delta = data quality x data quantity x transformer architecture tweaks x compute cost x talent x time.
The ceiling for the cost parameters in this equation are determined by expected market opportunity, at the margin - how much more of the market can you capture if you have the better tech.