undefined | Better HN

0 pointspostalcoder17d ago0 comments

I made pelicans at different thinking efforts:

https://hcker.news/pelican-low.svg

https://hcker.news/pelican-medium.svg

https://hcker.news/pelican-high.svg

https://hcker.news/pelican-xhigh.svg

Someone needs to make a pelican arena, I have no idea if these are considered good or not.

0 comments

deflator17d ago

They are not good, and they seem to get worse as you increased effort. Weird

postalcoderOP17d ago

Yeah. I've always loosely correlated pelican quality with big model smell but I'm not picking that up here. I thought this was supposed to be spud? Weird indeed.

throw31082217d ago

No but I can sense the movement, I think it's already reached the level of intelligence that draws it towards futurism or cubism /s

seanw44417d ago

Can someone explain how we arrived at the pelican test? Was there some actual theory behind why it's difficult to produce? Or did someone just think it up, discover it was consistently difficult, and now we just all know it's a good test?

simonw17d ago

I set it up as a joke, to make fun of all of the other benchmarks. To my surprise it ended up being a surprisingly good measure of the quality of the model for other tasks (up to a certain point at least), though I've never seen a convincing argument as to why.

I gave a talk about it last year: https://simonwillison.net/2025/Jun/6/six-months-in-llms/

It should not be treated as a serious benchmark.

jimbokun17d ago

What it has going for it is human interpretability.

Anyone can look and decide if it’s a good picture or not. But the numeric benchmarks don’t tell you much if you aren’t already familiar with that benchmark and how it’s constructed.

billywhizz16d ago

how can you say "it ended up being a surprisingly good measure of the quality of the model for other tasks" and also "It should not be treated as a serious benchmark" in the same comment?

if it is indeed a good measure of the quality of the model (hint: it's not) then, logically, it should be taken seriously.

this is, sadly, a great example of the kind of doublethink the "AI" hypesters (yes - whether you like it or not simon - that is what you are now) are all too capable of.

simonw16d ago

I genuinely don't see how those two statements conflict with each other.

Despite not being a serious benchmark (how could it be serious? It's a pelican riding a bicycle!) it still turned out to have some value. You can see that just by scrolling through the archives and watching it improve as the models improved.

If your definition of doublethink is "holding two conflicting ideas in your head at once" then I would say doublethink is a necessary skill for navigating the weird AI era we find ourselves inhabiting.

1 more reply

redox9917d ago

It all began with a Microsoft researcher showing a unicorn drawn in tikz using GPT4. It was an example of something so outrageous that there was no way it existed in the training data. And that's back when models were not multimodal.

Nowadays I think it's pretty silly, because there's surely SVG drawing training data and some effort from the researchers put onto this task. It's not a showcase of emergent properties.

Gander573917d ago

https://simonwillison.net/2025/Jun/6/six-months-in-llms/

CamperBob217d ago

It's interesting to see some semblance of spatial reasoning emerge from systems based on textual tokens. Could be seen as a potential proxy for other desirable traits.

It's meta-interesting that few if any models actually seem to be training on it. Same with other stereotypical challenges like the car-wash question, which is still sometimes failed by high-end models.

If I ran an AI lab, I'd take it as a personal affront if my model emitted a malformed pelican or advised walking to a car wash. Heads would roll.

bravoetch17d ago

I tried getting it to generate openscad models, which seems much harder. Not had much joy yet with results.

a9616d ago

G code and ascii art are also text formats, but seem to be beyond most if not all models.

(There are some that generate 3d models specifically, more in the image generation family than chatbot family.)

lexarflash8g17d ago

None of them have the pelican's feet placed properly on the pedals -- or the pedals are misrepresented. Cool art style but not physically accurate.

a9616d ago

I'm not sure a physically accurate pelican would reach two pedals on a common bicycle. Maybe a model can solve that problem one day.

lostmsu17d ago

https://pelicans.borg.games/

j / k navigate · click thread line to collapse

0 comments

deflator17d ago

They are not good, and they seem to get worse as you increased effort. Weird

postalcoderOP17d ago

Yeah. I've always loosely correlated pelican quality with big model smell but I'm not picking that up here. I thought this was supposed to be spud? Weird indeed.

throw31082217d ago

No but I can sense the movement, I think it's already reached the level of intelligence that draws it towards futurism or cubism /s

seanw44417d ago

simonw17d ago

I gave a talk about it last year: https://simonwillison.net/2025/Jun/6/six-months-in-llms/

It should not be treated as a serious benchmark.

jimbokun17d ago

What it has going for it is human interpretability.

Anyone can look and decide if it’s a good picture or not. But the numeric benchmarks don’t tell you much if you aren’t already familiar with that benchmark and how it’s constructed.

billywhizz16d ago

how can you say "it ended up being a surprisingly good measure of the quality of the model for other tasks" and also "It should not be treated as a serious benchmark" in the same comment?

if it is indeed a good measure of the quality of the model (hint: it's not) then, logically, it should be taken seriously.

this is, sadly, a great example of the kind of doublethink the "AI" hypesters (yes - whether you like it or not simon - that is what you are now) are all too capable of.

simonw16d ago

I genuinely don't see how those two statements conflict with each other.

1 more reply

redox9917d ago

Nowadays I think it's pretty silly, because there's surely SVG drawing training data and some effort from the researchers put onto this task. It's not a showcase of emergent properties.

Gander573917d ago

https://simonwillison.net/2025/Jun/6/six-months-in-llms/

CamperBob217d ago

It's interesting to see some semblance of spatial reasoning emerge from systems based on textual tokens. Could be seen as a potential proxy for other desirable traits.

If I ran an AI lab, I'd take it as a personal affront if my model emitted a malformed pelican or advised walking to a car wash. Heads would roll.

bravoetch17d ago

I tried getting it to generate openscad models, which seems much harder. Not had much joy yet with results.

a9616d ago

G code and ascii art are also text formats, but seem to be beyond most if not all models.

(There are some that generate 3d models specifically, more in the image generation family than chatbot family.)

lexarflash8g17d ago

None of them have the pelican's feet placed properly on the pedals -- or the pedals are misrepresented. Cool art style but not physically accurate.

a9616d ago

I'm not sure a physically accurate pelican would reach two pedals on a common bicycle. Maybe a model can solve that problem one day.

lostmsu17d ago

https://pelicans.borg.games/

j / k navigate · click thread line to collapse