Indexing iCloud Photos with AI Using LLaVA and Pgvector (opens in new tab)

(medium.com)

208 pointsCSDude2y ago35 comments

35 comments

30 comments · 12 top-level

warangal2y ago· 5 in thread

I think image-encoder from CLIP (even smallest variant ViT B/32) is good enough to capture a lot of semantic information to allow natural language query once images are indexed. A lot of work actually goes into integrating with existing meta-data like local-directory, date-time to augment NL query and re-ranking the results.

I work on such a tool[0] to enable end to end indexing of user's personal photos and recently added functionality to index Google Photos too!

[0] https://github.com/eagledot/hachi

3abiton2y ago

I would love to see some benchmark on that

warangal2y ago

I keep forgetting to put a benchmark for a standard flickr30k like dataset! But a ballpark figure should be about 100ms per image on a quad-core CPU, i also generate an ETA during indexing and provide some meta-information to make it easy to get information about data being indexed.

Zetobal2y ago

vit h and g are fine I wouldn't use b anymore.

warangal2y ago

It is quite possible B variant is not enough for some scenarios, earlier version also included the videos search, frames used for indexing were sometimes blur (not having fine-details) and these frames generally would have higher score for naive Natural language queries. I only tested with B variant.

But i resolved that problem upto a point by adding a Linear layer trained to discard such frames, and it was less costly than running a bigger variant for my use case.

burningion2y ago

Can you give details as to why not?

jsmith992y ago· 4 in thread

Immich (self hosted Google photos alternative) has been using CLIP models for smart search for a while and anecdotally seems to work really well - it indexes fast and results are of similar quality to the giant SaaS providers.

eurekin2y ago

I learned it the hard way while trying to index few TBs of photos. I couldn't have it finished, always got stuck after 15 ish hours.

diggan2y ago

> has been using CLIP models for smart search for a while and anecdotally seems to work really well [..] results are of similar quality to the giant SaaS providers

I'm not super familiar with how the results for the "giant SaaS providers" are, but the demo instance of Immich doesn't seem to do it very well.

Example query for "airplane": https://demo.immich.app/search?q=airplane&clip=true

Even the fourth result seems to rank higher than photos of actual airplanes, and most of the results aren't actually airplanes at all.

Again, not sure how that compares with other providers, but on Google Photos (as one example I am familiar with), searching for "airplane" shows me photos taken of airplanes, or photos taken from the inside of an airplane. Even lego airplanes seems to show up correctly, and none of the photos are incorrectly shown as far as I can tell.

jsmith992y ago

I’ve just tried that and it’s true although on my instance searching ‘airplane’ gives good results. I wonder if it’s due to an insufficient number of images in the demo? I also took the advice in the forums to tweak the exact model version used.

ninja39252y ago

We use CLIP internally (large US tech company) and it works very well at a large scale

clord2y ago· 3 in thread

Is anyone aware of a model that is trained to give photos a quality rating? I have decades of RAW files sitting on my server that I would love to pass over and tag those that are worth developing more. Would be nice to make a short list.

joshvm2y ago

Both Google and Apple have implemented models that aim to take the "best" picture from a video sequence, like a live photo.

Have a look at MUSIQ and NIMA.

https://github.com/google-research/google-research/tree/mast...

https://blog.research.google/2022/10/musiq-assessing-image-a...

twoWhlsGud2y ago

So I think some sort of hybrid between object recognition (like being discussed here as part of the workflow) and standard image processing stuff could be helpful there. E.g. it's not absolute sharpness that you're looking for it's the subject being sharp (and possibly sharper than in other photos from the same time period of the same subject).

tlack2y ago

Try LAION Aesthetics: https://laion.ai/blog/laion-aesthetics/

GaggiX2y ago· 2 in thread

For indexing images is probably convenient to directly calculate the embeddings using CLIP image encoder and retrieve them using the CLIP text encoder.

speedgoose2y ago

Going through a LLM may improve the performance. From my experience working with Stable Diffusion 1.*, clip is not very intelligent and a 7B quantised LLM could help a lot.

kkielhofner2y ago

I second this. CLIP, BLIP, etc alone are light but pretty dumb for captioning in the grand scheme of things.

CLIP is reasonable for reverse image search via embeddings but many of the models in this class don't work very well for captioning because they're trained on COCO, etc and they're pretty generic.

1 more reply

voiper12y ago· 2 in thread

Is there a state of the art for face matching? I love being able to put in a name and find all the photos they are in.

I don't even mind some training of "are these the same or not"

That's one of the conveniences that means I'm still using google photos...

tlack2y ago

I haven't used it for search, but I believe Insightface's embeddings can be used for this purpose. https://insightface.ai/

eurekin2y ago

One long winded way could be using the Lightroom for that. It finds and groups faces. Also, maybe it can save that info into fmthe file itself (with xmp)

reacharavindh2y ago· 1 in thread

A nice work. I’m thinking it could even be tinkered further by incorporating location information, date and time, and even people (facial recognition) data from the photos, and have an LLM write one “metadata text” for every photo. This way one can query “ person X traveling with Y to Norway about 7 years ago” and quickly get useful results.

ssijak2y ago

that is exactly what I wanted to do for my Apple Photos lib but have not yet got the time to spend on it. Apple Photos search is just bad, very bad.

behnamoh2y ago· 1 in thread

I'm still tryna understand the difference between multimodal models like Llava and projects like JARVIS that connect LLMs to other huggingface models (including object detection models) or clip. Is a multimodal model doing this under the hood?

michaelt2y ago

Object detection models have human-comprehensible outputs. You can feed in a picture and it'll tell you that there's a child and a cat, and it'll draw bounding boxes around them. You can pass that info into an LLM if you want.

The downside to that approach is the LLM can't tell whether the cat is standing in front of the child, or sitting on the child, or the child is holding the cat; the input just tells it there's a child, and a cat, and their bounding boxes overlap.

In contrast, LLaVA feeds feeds the image into a visual encoder called 'CLIP' which doesn't output anything human-comprehensible - it just gives out a bunch of numbers which have something to do with the contents of the image. But the numbers can be fed into the LLM along with text - and they can train the image encoder and the LLM together.

If the training works right, and they have enough training data for the model to figure out the difference between a cat sitting on a lap and one being held, they end up with a model that can figure out that the child is holding the cat.

viraptor2y ago

Since llava is multimodal, I wonder if there's a chance here to strip a bit of complexity. Specifically, instead of going through 3 embeddings (llava internal, text, mini-lm), could you use the not-last layer of llava as your vector? It would probably require a bit of fine-tuning though.

For pure text, that's kind of how e5-mistral works https://huggingface.co/intfloat/e5-mistral-7b-instruct Or yeah, just use clip like another commenter suggests...

dmezzetti2y ago

Here is an example that builds a vector index of images using the CLIP model.

https://neuml.hashnode.dev/similarity-search-with-images

This allows queries with both text and images.

vladgur2y ago

This is pretty awesome, but I’m curious if it can be used to “enhance” the existing iCloud search which is great at identifying people in my photos even kids as they age.

I would not want to lose that functionality

diggan2y ago

Slightly related, are there any good photo management alternatives to Photoprism that leverages more recent AI/ML technologies and provides a GUI for end users?

say_it_as_it_is2y ago

I really appreciate itch scratching posts like these. The life story is as important as the workflow.

j / k navigate · click thread line to collapse

35 comments

30 comments · 12 top-level

warangal2y ago· 5 in thread

I work on such a tool[0] to enable end to end indexing of user's personal photos and recently added functionality to index Google Photos too!

[0] https://github.com/eagledot/hachi

3abiton2y ago

I would love to see some benchmark on that

warangal2y ago

Zetobal2y ago

vit h and g are fine I wouldn't use b anymore.

warangal2y ago

But i resolved that problem upto a point by adding a Linear layer trained to discard such frames, and it was less costly than running a bigger variant for my use case.

burningion2y ago

Can you give details as to why not?

jsmith992y ago· 4 in thread

eurekin2y ago

I learned it the hard way while trying to index few TBs of photos. I couldn't have it finished, always got stuck after 15 ish hours.

diggan2y ago

> has been using CLIP models for smart search for a while and anecdotally seems to work really well [..] results are of similar quality to the giant SaaS providers

I'm not super familiar with how the results for the "giant SaaS providers" are, but the demo instance of Immich doesn't seem to do it very well.

Example query for "airplane": https://demo.immich.app/search?q=airplane&clip=true

Even the fourth result seems to rank higher than photos of actual airplanes, and most of the results aren't actually airplanes at all.

jsmith992y ago

ninja39252y ago

We use CLIP internally (large US tech company) and it works very well at a large scale

clord2y ago· 3 in thread

joshvm2y ago

Both Google and Apple have implemented models that aim to take the "best" picture from a video sequence, like a live photo.

Have a look at MUSIQ and NIMA.

https://github.com/google-research/google-research/tree/mast...

https://blog.research.google/2022/10/musiq-assessing-image-a...

twoWhlsGud2y ago

tlack2y ago

Try LAION Aesthetics: https://laion.ai/blog/laion-aesthetics/

GaggiX2y ago· 2 in thread

For indexing images is probably convenient to directly calculate the embeddings using CLIP image encoder and retrieve them using the CLIP text encoder.

speedgoose2y ago

Going through a LLM may improve the performance. From my experience working with Stable Diffusion 1.*, clip is not very intelligent and a 7B quantised LLM could help a lot.

kkielhofner2y ago

I second this. CLIP, BLIP, etc alone are light but pretty dumb for captioning in the grand scheme of things.

CLIP is reasonable for reverse image search via embeddings but many of the models in this class don't work very well for captioning because they're trained on COCO, etc and they're pretty generic.

1 more reply

voiper12y ago· 2 in thread

Is there a state of the art for face matching? I love being able to put in a name and find all the photos they are in.

I don't even mind some training of "are these the same or not"

That's one of the conveniences that means I'm still using google photos...

tlack2y ago

I haven't used it for search, but I believe Insightface's embeddings can be used for this purpose. https://insightface.ai/

eurekin2y ago

One long winded way could be using the Lightroom for that. It finds and groups faces. Also, maybe it can save that info into fmthe file itself (with xmp)

reacharavindh2y ago· 1 in thread

ssijak2y ago

that is exactly what I wanted to do for my Apple Photos lib but have not yet got the time to spend on it. Apple Photos search is just bad, very bad.

behnamoh2y ago· 1 in thread

michaelt2y ago

viraptor2y ago

For pure text, that's kind of how e5-mistral works https://huggingface.co/intfloat/e5-mistral-7b-instruct Or yeah, just use clip like another commenter suggests...

dmezzetti2y ago

Here is an example that builds a vector index of images using the CLIP model.

https://neuml.hashnode.dev/similarity-search-with-images

This allows queries with both text and images.

vladgur2y ago

This is pretty awesome, but I’m curious if it can be used to “enhance” the existing iCloud search which is great at identifying people in my photos even kids as they age.

I would not want to lose that functionality

diggan2y ago

Slightly related, are there any good photo management alternatives to Photoprism that leverages more recent AI/ML technologies and provides a GUI for end users?

say_it_as_it_is2y ago

I really appreciate itch scratching posts like these. The life story is as important as the workflow.

j / k navigate · click thread line to collapse