I work on such a tool[0] to enable end to end indexing of user's personal photos and recently added functionality to index Google Photos too!
But i resolved that problem upto a point by adding a Linear layer trained to discard such frames, and it was less costly than running a bigger variant for my use case.
I'm not super familiar with how the results for the "giant SaaS providers" are, but the demo instance of Immich doesn't seem to do it very well.
Example query for "airplane": https://demo.immich.app/search?q=airplane&clip=true
Even the fourth result seems to rank higher than photos of actual airplanes, and most of the results aren't actually airplanes at all.
Again, not sure how that compares with other providers, but on Google Photos (as one example I am familiar with), searching for "airplane" shows me photos taken of airplanes, or photos taken from the inside of an airplane. Even lego airplanes seems to show up correctly, and none of the photos are incorrectly shown as far as I can tell.
Have a look at MUSIQ and NIMA.
https://github.com/google-research/google-research/tree/mast...
https://blog.research.google/2022/10/musiq-assessing-image-a...
CLIP is reasonable for reverse image search via embeddings but many of the models in this class don't work very well for captioning because they're trained on COCO, etc and they're pretty generic.
I don't even mind some training of "are these the same or not"
That's one of the conveniences that means I'm still using google photos...
The downside to that approach is the LLM can't tell whether the cat is standing in front of the child, or sitting on the child, or the child is holding the cat; the input just tells it there's a child, and a cat, and their bounding boxes overlap.
In contrast, LLaVA feeds feeds the image into a visual encoder called 'CLIP' which doesn't output anything human-comprehensible - it just gives out a bunch of numbers which have something to do with the contents of the image. But the numbers can be fed into the LLM along with text - and they can train the image encoder and the LLM together.
If the training works right, and they have enough training data for the model to figure out the difference between a cat sitting on a lap and one being held, they end up with a model that can figure out that the child is holding the cat.
For pure text, that's kind of how e5-mistral works https://huggingface.co/intfloat/e5-mistral-7b-instruct Or yeah, just use clip like another commenter suggests...
https://neuml.hashnode.dev/similarity-search-with-images
This allows queries with both text and images.
I would not want to lose that functionality