NeuralTalk2: Efficient Image Captioning code in Torch, runs on GPU (opens in new tab)

(github.com)

66 pointsrshaban10y ago23 comments

23 comments

15 comments · 3 top-level

karpathy10y ago· 11 in thread

People might be interested in this video by @kcimc , where Kyle runs the pretrained model forward in real time on his laptop while walking around the streets of Amsterdam

https://vimeo.com/146492001

Something people don't fully appreciate about neural networks is that their performance is quite a strong function of their training data. In this case the training data is taken from the MS COCO dataset (http://mscoco.org/explore/). That's why, for example, when Kyle points the camera at himself the model says something along the lines of "man with a suit and tie" - there is a very strong correlation between that kind of an image in the data, and the presence of a suit and tie. With such a strong correlation the model doesn't have a chance to tease the two concepts apart. A similar problem would come up with an ImageNet model, where a similar image might be classified as "seatbelt", because there is no Person class there, and shots of people in that pose usually come from the seatbelt class. It happens to be the most similar concept in the data it has seen. Another example is if you pointed the model at trees it might hallucinate a giraffe, since the two are strongly correlated in the data. Or when Kyle points the camera at the ground I'm fully expecting it to say relatively random things, because I know that those kinds of images are very rare in the training data.

In other words, a lot of the "mistakes" are limitations of training data and its variety rather than something to do with the model itself, and it's easier to recognize this if you're familiar with the training data and its classes and distribution.

gabipurcaru10y ago

Is there a way to let the algorithm not make a guess if it doesn't find a strong match? I guess it would be better to not output anything than to output gibberish, right?

rasz_pl10y ago

This algorithm needs badly temporal dimension, some kind of short term memory that lets it interpret using context. At the very least to filter out freaky readings of a train station when looking at the ground, best case scenario it would enable building deeper understating of its surroundings. Maybe not even memory, but Bayes filter to prime next estimation. Then throw movies at it.

Even as it is this could be adapted for the blind. I can imagine app that will simply build a model of what it sees and answer questions or warn about stairs/walls/roads/other dangers. There isnt all that much to make it as clever as a guide dog.

chimtim10y ago

Thanks a lot for your hard work! This is amazing work. I was wondering if you had any sense of which one of the models works best (or when one is better than another) -- Stanford, Toronto (Show, attend, tell), and Google (Show and tell) ?

fchollet10y ago

Here's the MS COCO leaderboard: http://mscoco.org/dataset/#captions-leaderboard

Google's Show and Tell seems considerably superior to competing approaches.

1 more reply

ilurk10y ago

Is your opinion that more data will solve this?

If so, I'm guessing Google, maybe Facebook too, has plenty of data. What else is holding them back?

karpathy10y ago

In image captioning specifically we are in a dire need of data. To give you an idea, MS COCO is ~100K images. ImageNet Challenge is 1.2 million. The dataset is quite miniscule, which restricts the richness of models we can explore, and forces strong regularization concerns. The place I normally like to be is when my several hundred million parameter nets are underfitting - that's where neural nets really shine - and MS COCO is not that.

Also it's not only the size of the dataset, it's also the size/variety in the label space. ImageNet is quite comprehensive, with many varied labels. MS COCO is quite biased towards a narrow ~hundred classes.

I'd love to see a properly large dataset of images "from the wild", with no restrictions on content (unlike what is done in MS COCO), annotated with sentences. From my experience with adding data to models in these situations I'm quite certain this would work _significantly_ better.

1 more reply

comark10y ago

Could you upload the script "CameraImage" used in the video to do the captioning in real-time? It's not present in the Github repository :(

karpathy10y ago

apparently it's a "terrible hack" with text files https://twitter.com/kcimc/status/668111582296190976

1 more reply

emcq10y ago

Are there any interesting approaches you've seen to help reduce overfitting with these class imbalance issues?

karpathy10y ago

It's not exactly overfitting. It's more that the model is seeing an image that is out of sample, but our approach _forces_ it to say something, so it will do its best but usually fail.

One simple low-hanging-fruit approach would be to include a large repository of additional data (e.g. all of ImageNet) and label it all as a "garbage" class. This way the model could at least learn to distinguish the kinds of images in its training data from the universe of images, and this could be used as one proxy of confidence.

Another simple proxy is to look at the probability of the generated sample, since usually the model tends to assign more diffuse probabilities in more uncertain cases. But this is also not a very clean approach for various reasons.

Another, and probably most appealing, approach would be something along the lines of Bayesian Neural Networks, ensembles, or approximations with dropout, where the disagreement between the predictions of all submodels can be used.

2 more replies

mrdrozdov10y ago

I don't think that the seatbelt example is overfitting. It's a consequence of their not being the "correct" available label. In many ways I guess the seatbelt would be the best option.

1 more reply

stuff1234432110y ago· 1 in thread

Thanks as always :)

Do you plan to add beamsearch?

ram2110y ago

Karpathy just added beam search.

ram2110y ago

Thank you. You mentioned that you plan on adding a re-ranker. Is that a re-ranker that encourages diversity? Just like what is done in this paper: http://arxiv.org/pdf/1510.03055.pdf

j / k navigate · click thread line to collapse

23 comments

15 comments · 3 top-level

karpathy10y ago· 11 in thread

People might be interested in this video by @kcimc , where Kyle runs the pretrained model forward in real time on his laptop while walking around the streets of Amsterdam

https://vimeo.com/146492001

gabipurcaru10y ago

Is there a way to let the algorithm not make a guess if it doesn't find a strong match? I guess it would be better to not output anything than to output gibberish, right?

rasz_pl10y ago

chimtim10y ago

fchollet10y ago

Here's the MS COCO leaderboard: http://mscoco.org/dataset/#captions-leaderboard

Google's Show and Tell seems considerably superior to competing approaches.

1 more reply

ilurk10y ago

Is your opinion that more data will solve this?

If so, I'm guessing Google, maybe Facebook too, has plenty of data. What else is holding them back?

karpathy10y ago

1 more reply

comark10y ago

Could you upload the script "CameraImage" used in the video to do the captioning in real-time? It's not present in the Github repository :(

karpathy10y ago

apparently it's a "terrible hack" with text files https://twitter.com/kcimc/status/668111582296190976

1 more reply

emcq10y ago

Are there any interesting approaches you've seen to help reduce overfitting with these class imbalance issues?

karpathy10y ago

It's not exactly overfitting. It's more that the model is seeing an image that is out of sample, but our approach _forces_ it to say something, so it will do its best but usually fail.

2 more replies

mrdrozdov10y ago

I don't think that the seatbelt example is overfitting. It's a consequence of their not being the "correct" available label. In many ways I guess the seatbelt would be the best option.

1 more reply

stuff1234432110y ago· 1 in thread

Thanks as always :)

Do you plan to add beamsearch?

ram2110y ago

Karpathy just added beam search.

ram2110y ago

Thank you. You mentioned that you plan on adding a re-ranker. Is that a re-ranker that encourages diversity? Just like what is done in this paper: http://arxiv.org/pdf/1510.03055.pdf

j / k navigate · click thread line to collapse