undefined | Better HN

0 pointsGaggiX2y ago0 comments

CLIP does not generate captions, it's simply an encoder, the image and text encoders are aligned so you don't need to generate a caption, you simply encode the image and you later retrieve it using the vector crated by the text encoder (the query).

0 comments

2 comments · 1 top-level

kkielhofner2y ago· 1 in thread

I'm using CLIP here generically to refer to families/models generating captions by leveraging CLIP as the encoder - of which there are plenty on "The Hub".

Have you actually done the approach I think you're suggesting for anything more complex than "this is a yellow cat"? Not trying to be snarky, genuinely curious. I've done a few of these projects and this approach never comes close to meeting user expectations in the real world.

GaggiXOP2y ago

Do you an example of a query that should fail by using the CLIP embeddings directly but works with the method describe in the article?

j / k navigate · click thread line to collapse