Skip to content
Better HN
Top
Best
Ask
Show
New
Jobs
Search
⌘K
0 points
jabron
7mo ago
0 comments
Save
Share
What do you mean "bounding boxes"? They were talking about captions and embeddings, so a vision language model is required.
0 comments
1 comments · 1 top-level
top
newest
oldest
Glemkloksdjf
7mo ago
I suggested YOLO and non llm-vl as a lot faster alternative.
Of course CLIP would be otherwise the other option than a big llm-vl one.
j
/
k
navigate · click thread line to collapse