My understanding is that machine learning today is a lot like interpolation of examples in the dataset. The breakthrough of LLMs is due to the idea that interpolation in a 1024-dimensional space works much better than in a 2d space, if we naively interpolated English letters. All the modern transformers stuff is basically an advanced interpolation method that uses a large local neighborhood than just few nearest examples. It's like the Lanczos interpolation kernel, using a 1d analogy. Increasing the size of the kernel won't bring any gains, because the current kernel already nearly perfectly approximates an ideal interpolation (a full dataset DFT).
However interpolation isn't reasoning. If we want to understand the motion of planets, we would start with a dataset of (x, y, z, t) coordinates and try to derive the law of motion. Imagine if someone simply interpolated the dataset and presented the law of gravity as an array of million coefficients (aka weights)? Our minds have to work with a very small operating memory that can hardly fit 10 coefficients. This constraint forces us to develop intelligence that compacts the entire dataset into one small differential equation. Btw, English grammar is the differential equation of English in a lot of ways: it tells what the local rules are of valid trajectories of words that we call sentences.