However, I am making no such argument. I am explaining that statistical models of pixel frequencies cannot model the causes of those frequencies. I am illustrating this point with an example, not proving it.
If you want more detail about the reason it cannot: when the back of a coffee cup looks like the front, you can generate the back. But you cannot generate the bottom. (assuming the bottom doesn't occur in the dataset) -- why? Because the pixel distributions for the bottom of a cup have zero information about the rest of it.. and the model has no information about the bottom.
If you want a "proof" you'd need at least to be familiar with applied mathematics and the like:
Say the RGB value of each pixel, X of photos of coffee cups obtains from a data generating process parameterized on: distance from camera, lens focal length, angle to cup, lighting conditions, etc. Now produce a model of such causes, call it Environment(distance, angle, cup albedio,...).
Then show that X ~ E|fixed-paramerters induces a frequency distribution of pixels, f1(next|previous) = P(Xi...n|Xj...n); then for any variation in a fixed parameter induces a completely different distribution, say f2, f3, f4, ... Now check that the covariance distribution for most pairs of fs, shows that any given f is almost zero-informative about any other f.
Having done this, compare with a non-statistical (eg., video game) model of Environment where parameters are varied.. and show that all frames, say v, of the video game generated do have high covariance over the time of their sampling. The video game model covaries with most f1..fn; for the associative statistical model it only covaries with f1, or a very small number of others.
There's something very obvious about this if you understand how these statistical AI systems work: in cases where variations in the environment induce radically different distributions the AI will fail; in cases where they are close enough, they will (appear) to succeed.
The marketability of generative AI comes from rigging the use cases to situations where we don't need to change the environment. ie., you aren't exposed to the fact that when you generated a photo you could not have got the same one "at a different distance".
If a video game was built this way it would be unplayable: consider every time you move the camera all the objects randomly change their apparent orientation, distance, style, etc.
Yes, LLMs are constrained in what output they can generate based on their training data. Just as we humans are constrained in the output we can generate. When we talk about things we don't understand we speak gibberish, just like LLMs.
That isn't the exact same constraint. We could speculate that the moon had a "dark side," because we understood what a moon was, and what a sphere was. LLMs cannot speculate about things outside of their existing data model, at all.
>When we talk about things we don't understand we speak gibberish, just like LLMs.
No we don't, wtf? We may create inaccurate models or theories, but we don't just chain together random strings of words the way LLMs do.