> I just don't think enough attention has been paid to the data, and too much the model.
I wholly agree. Everyone is blinded by models - GPT4 this, LLaMA2 that - but the real source of the smarts is in the dataset. Why would any model, no matter how its architecture is tweaked, learn about the same ability from the same data? Why would humans be all able to learn the same skills when every brain is quite different. It was the data, not the model
And since we are exhausting all the available quality text online we need to start engineering new data with LLMs and validation systems. AIs need to introspect more into their training sets, not just train to reproduce them, but analyse, summarise and comment on them. We reflect on our information, AIs should do more reflection before learning.
More fundamentally, how are AIs going to evolve past human level unless they make their own data or they collect data from external systems?