[1] https://staff.fnwi.uva.nl/r.vandenboomgaard/IPCV/_downloads/...
[2] https://www.cs.utexas.edu/users/dana/Swain1.pdf
[3] http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03....
[4] http://www-inst.eecs.berkeley.edu/~cs294-6/fa06/papers/niste...
Its been acouple years since my computer vision course (was my favorite course in university) but isn't SIFT a bit '99? Aren't there better methods now such as neural networks for feature description?
You could then use classical text indexing on the text, perhaps with a topic model like LDA. Then an image with a plane in it will be indexed by "plane" via the output of the neural network but would also come first, or in the top results, when using "flight" as the query via a topic model.
Ditto for word2vec or para2vec over those words, the benefit being you can bring the knowledge of relations contained in the textual training data, Wikipedia or something else, to bear on the problem. I.e. a golf club and a baseball glove might not be correlated in the neural network that annotated the images but might be correlated in the text based knowledge model trained on Wikipedia and so a query of "sport" might bring both images up.
The big players like Google already have caption generation that's capturing relationships between objects.[1]
[0]http://scikit-image.org/docs/dev/auto_examples/features_dete... [1]https://arxiv.org/pdf/1411.4555.pdf
What is great about stuff like SIFT and more modern ORB, BRISK and AKAZE is that they are fast and given appropriate implementation they would work just as well as Neural Network would. I haven't researched NN computer vision whole lot, but it seems like it might be slower in feature detection/description compared to traditional approach. If that's the case then for live/near-live video processing NN won't work that well.
You're right that bag of words with SIFT is not state of the art, with deep learning dominating computer vision approaches these days.
[1] http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/ba...
https://www.elastic.co/elasticon/conf/2016/sf/opensource-con...
But with some of the machine vision API google cloud etc. you could extend to other "features"
Details: http://blog.sandeepchivukula.com/posts/2016/03/06/photo-sear...
Bucketing means that you can't get the granularity of a specific shade of a color.
http://sujitpal.blogspot.com/2016/06/comparison-of-image-sea...
I describe how I use transfer learning to generate image vectors for my butterfly images from the Caffe reference model trained on ImageNet. There are some other approaches too, but probably not as interesting to this audience.
@GrantS - thanks for the intro and the links, and for pointing out that I was using BoVW incorrectly. Not an image person, trying to get into it from a search/NLP background.
@infinitone and @deckar01 - I measured the goodness of search by checking how many times the query butterfly was the #1 result. I agree its pretty hard to distinguish butterflies from one another, but I needed something to work with, and that was as good as any. I did not want one that crosses subject areas, such as flowers and cars, so a search for a red flower might bring back a red car.
@chrischen - if you got rid of the bucketing portion in my pipeline, I think you might achieve what you are after. But if you are only looking for exact match on color, then hashing might be cheaper.
@rcarmo - thanks for the reference to pHash, I think it might work better in near-duplicate detection kind of cases. Will check it out.