Skip to content
Better HN
Top
New
Best
Ask
Show
Jobs
Search
⌘K
0 points
jabron
4mo ago
0 comments
Share
What do you mean "bounding boxes"? They were talking about captions and embeddings, so a vision language model is required.
undefined | Better HN
0 comments
default
newest
oldest
Glemkloksdjf
4mo ago
I suggested YOLO and non llm-vl as a lot faster alternative.
Of course CLIP would be otherwise the other option than a big llm-vl one.
j
/
k
navigate · click thread line to collapse