undefined | Better HN

0 pointsjabron7mo ago0 comments

What do you mean "bounding boxes"? They were talking about captions and embeddings, so a vision language model is required.

0 comments

1 comments · 1 top-level

I suggested YOLO and non llm-vl as a lot faster alternative.

Of course CLIP would be otherwise the other option than a big llm-vl one.

j / k navigate · click thread line to collapse