I do not know if it works as well as Gemini, but Salesforce (of all places) has a model that does something similar.
What's "neat" about the Salesforce one is that you can run it locally and just iterate it over as many images as you feel like.
For instance, it should be possible to take a movie, pull a hundred images out of the h265 file, have the salesforce model evaluate what is happening at that moment in the movie, and then use that to create an index.
That's just ONE use for it, and I can think of dozens.
On a 5090 it was able to generate text descriptions of a folder full of approximately 500 images in under a minute. (Anecdotal evidence, admittedly.)
https://huggingface.co/Salesforce/blip-image-captioning-base
I just looked up some articles on it here, and it looks like it's fairly old, so YMMV.