Suppose I take a billion images of all the coffee cups in the world, at a set of angles on the cup, and then build an associative (ie., frequency) statistical model of their pixels (ie., statistical AI). Consider generating one pixel at a time, in sequence, through the image. My associative model tells me P(col of next pixel | all previous).
Now, I can generate coffee cups images similar to any variation or combination of the images in the dataset. Now, you might say, "well you can only do that if you have a model of a coffee cup" (rather than of pixels) -- if so, just generate a coffee cup at one of the angles not in the dataset. This will not happen, because the model has not been provided with enough information to do so.
Namely, the model does not know the distance from the camera, the camera lens parameters, the angle to the coffee cup, etc. So there's literally a very very large inifinity of possible objects at unseen angles. Consider that underneath a coffee cup, the bottom might be missing entirely, etc.
Now it will appear to know all of these things, because its just generating images with these same parameters (camera, angle, distance, etc.). But as soon as you want "a coffee further away than has been seen before", or "a coffee using a macro lens", etc. the whole thing will fall over.
It is you, the view, who attributes 3D knowledge to the model because under ordinary circumstances the cause of a photo is features of a 3D environment.