undefined | Better HN

0 pointstartoran19d ago0 comments

How could they be any good at visuals? They are trained on text after all.

0 comments

Supposedly the frontier LLMs are multimodal and trained on images as well, though I don't know how much that helps for tasks that don't use the native image input/output support.

Whatever the cause, LLMs have gotten significantly better over time at generating SVGs of pelicans riding bicycles:

https://simonwillison.net/tags/pelican-riding-a-bicycle/

But they're still not very good.

tartoranOP19d ago

I have to admit I'm seeing this for the first time and am somewhat impressed by the results and even think they will get better with more training, why not... But are these multimodal LLMs still LLMs though? I mean, they're still LLMs but with a sidecar that does other things and the training of the image takes place outside the LLMs so in a way the LLMs still don't "know" anything about these images, they're just generating them on the fly upon request.

simonw19d ago

Some of the LLMs that can draw (bad) pelicans on bicycles are text-input-only LLMs.

The ones that have image input do tend to do better though, which I assume is because they have better "spatial awareness" as part of having been trained on images in addition to text.

I use the term vLLMs or vision LLMs to define LLMs that are multimodal for image and text input. I still don't have a great name for the ones that can also accept audio.

The pelican test requires SVG output because asking a multimodal output model like Gemini Flash Image (aka Nano Banana) to create an image is a different test entirely.

boxedemp19d ago

Maybe we should drop one of the L's

astrange19d ago

Claude is multimodal and can see images, though it's not good at thinking in them.

msephton19d ago

Shapes can be described as text or mathematical formulas.

tempest_19d ago

An SVG is just text.

j / k navigate · click thread line to collapse