If you were solely doing reverse image search (submit image, generate embeddings, vector search) yes.
This is LLaVA -> text output -> sentence embedding -> (RAG style-ish) search on sentence embedding output based on query input text (back through the sentence embedding).
You could skip the LLaVA step and use CLIP/BLIP-ish caption output -> sentence embedding but pure caption/classification model text output is pretty terrible by comparison. Not only inaccurate, but very little to no context for semantic and extremely short so the sentence embedding models have poor quality input and not much to go on even when the caption/classification is decently accurate.