In this case it is actually relevant. The ability to draw a pelican on a bicycle correctly depends a great deal on understanding not only what both look like in general, but on the spatial relationships between the various objects and their parts. Models that can draw this kind of thing better also tend to be better at tasks that require understanding of how things go together and interact in 3D space.
How do we know it's not just a mashup of existing pictures? All generated pelicans on bikes look somewhat cartoonish and use historical or artsy bikes. This is training material from 2015:
There are other such images. Not an image model? How do we know that they don't convert all images to svg and train an LLM on it? How do we know that they do not cheat on this benchmark and route the query to an image model first?