undefined | Better HN

0 pointskkielhofner2y ago0 comments

I second this. CLIP, BLIP, etc alone are light but pretty dumb for captioning in the grand scheme of things.

CLIP is reasonable for reverse image search via embeddings but many of the models in this class don't work very well for captioning because they're trained on COCO, etc and they're pretty generic.

0 comments

3 comments · 1 top-level

bootsmann2y ago· 2 in thread

But this specific use case the extracts an embedding from the caption which is where CLIP would skip a lot of overhead by going from the image to the embedding directly.

kkielhofnerOP2y ago

If you were solely doing reverse image search (submit image, generate embeddings, vector search) yes.

This is LLaVA -> text output -> sentence embedding -> (RAG style-ish) search on sentence embedding output based on query input text (back through the sentence embedding).

You could skip the LLaVA step and use CLIP/BLIP-ish caption output -> sentence embedding but pure caption/classification model text output is pretty terrible by comparison. Not only inaccurate, but very little to no context for semantic and extremely short so the sentence embedding models have poor quality input and not much to go on even when the caption/classification is decently accurate.

GaggiX2y ago

CLIP does not generate captions, it's simply an encoder, the image and text encoders are aligned so you don't need to generate a caption, you simply encode the image and you later retrieve it using the vector crated by the text encoder (the query).

1 more reply

j / k navigate · click thread line to collapse