Give the model new images that are not in the training set (e.g. photos not on internet, or photos taken after model trained) and ask the same question and see how well it does!
The paper says: “Table 16. [snip] The prompt requires image understanding.”
I think the explanations (in the paper by OpenAI for the images) are probably misinformation or misdirection. I would guess it is recognising the images from it’s training and associating them with nearby text.
It's almost certainly a VQ-VAE-style encoding of the image itself into a sequence of tokens, as was done by DALL-E 1, CM3, Gato and a whole bunch of more recent models. It's the very obvious thing to do, and their context window is more than large enough now.