However it's still an interesting observation that many architectures can arrive at the same performance (even though the training requirements are different).
Naively, you wouldn't expect eg 'x -> a * x + b' to fit the same data as 'x -> a * sin x + b' about equally well. But that's an observation from low dimensions. It seems once you add enough parameters, the exact model doesn't matter too much for practical expressiveness.
I'm faintly reminded of the Church-Turing Thesis; the differences between different computing architectures are both 'real' but also 'just an optimisation'.
> When you start layering in normalisation techniques to minimise overfitting, and especially once you start thinking about more agentic architectures (eg. Deep Q Learning, some of the search space design going into OpenAI's o1), then I don't think the just-an-optimisation perspective can hold much water at all - more computing power simply couldn't solve those problems with older architectures.
You are right, these normalisation techniques help you economise on training data, not just on compute. Some of these techniques can be done independent of the model, eg augmenting your training data with noise. But some others are very model dependent.
I'm not sure how the 'agentic' approaches fit here.