Whatever the cause, LLMs have gotten significantly better over time at generating SVGs of pelicans riding bicycles:
https://simonwillison.net/tags/pelican-riding-a-bicycle/
But they're still not very good.
The ones that have image input do tend to do better though, which I assume is because they have better "spatial awareness" as part of having been trained on images in addition to text.
I use the term vLLMs or vision LLMs to define LLMs that are multimodal for image and text input. I still don't have a great name for the ones that can also accept audio.
The pelican test requires SVG output because asking a multimodal output model like Gemini Flash Image (aka Nano Banana) to create an image is a different test entirely.