https://hcker.news/pelican-low.svg
https://hcker.news/pelican-medium.svg
https://hcker.news/pelican-high.svg
https://hcker.news/pelican-xhigh.svg
Someone needs to make a pelican arena, I have no idea if these are considered good or not.
I gave a talk about it last year: https://simonwillison.net/2025/Jun/6/six-months-in-llms/
It should not be treated as a serious benchmark.
Anyone can look and decide if it’s a good picture or not. But the numeric benchmarks don’t tell you much if you aren’t already familiar with that benchmark and how it’s constructed.
if it is indeed a good measure of the quality of the model (hint: it's not) then, logically, it should be taken seriously.
this is, sadly, a great example of the kind of doublethink the "AI" hypesters (yes - whether you like it or not simon - that is what you are now) are all too capable of.
Despite not being a serious benchmark (how could it be serious? It's a pelican riding a bicycle!) it still turned out to have some value. You can see that just by scrolling through the archives and watching it improve as the models improved.
If your definition of doublethink is "holding two conflicting ideas in your head at once" then I would say doublethink is a necessary skill for navigating the weird AI era we find ourselves inhabiting.
Nowadays I think it's pretty silly, because there's surely SVG drawing training data and some effort from the researchers put onto this task. It's not a showcase of emergent properties.
It's meta-interesting that few if any models actually seem to be training on it. Same with other stereotypical challenges like the car-wash question, which is still sometimes failed by high-end models.
If I ran an AI lab, I'd take it as a personal affront if my model emitted a malformed pelican or advised walking to a car wash. Heads would roll.
(There are some that generate 3d models specifically, more in the image generation family than chatbot family.)