Something people don't fully appreciate about neural networks is that their performance is quite a strong function of their training data. In this case the training data is taken from the MS COCO dataset (http://mscoco.org/explore/). That's why, for example, when Kyle points the camera at himself the model says something along the lines of "man with a suit and tie" - there is a very strong correlation between that kind of an image in the data, and the presence of a suit and tie. With such a strong correlation the model doesn't have a chance to tease the two concepts apart. A similar problem would come up with an ImageNet model, where a similar image might be classified as "seatbelt", because there is no Person class there, and shots of people in that pose usually come from the seatbelt class. It happens to be the most similar concept in the data it has seen. Another example is if you pointed the model at trees it might hallucinate a giraffe, since the two are strongly correlated in the data. Or when Kyle points the camera at the ground I'm fully expecting it to say relatively random things, because I know that those kinds of images are very rare in the training data.
In other words, a lot of the "mistakes" are limitations of training data and its variety rather than something to do with the model itself, and it's easier to recognize this if you're familiar with the training data and its classes and distribution.
This algorithm needs badly temporal dimension, some kind of short term memory that lets it interpret using context. At the very least to filter out freaky readings of a train station when looking at the ground, best case scenario it would enable building deeper understating of its surroundings. Maybe not even memory, but Bayes filter to prime next estimation. Then throw movies at it.
Even as it is this could be adapted for the blind. I can imagine app that will simply build a model of what it sees and answer questions or warn about stairs/walls/roads/other dangers. There isnt all that much to make it as clever as a guide dog.
Google's Show and Tell seems considerably superior to competing approaches.
If so, I'm guessing Google, maybe Facebook too, has plenty of data. What else is holding them back?
Also it's not only the size of the dataset, it's also the size/variety in the label space. ImageNet is quite comprehensive, with many varied labels. MS COCO is quite biased towards a narrow ~hundred classes.
I'd love to see a properly large dataset of images "from the wild", with no restrictions on content (unlike what is done in MS COCO), annotated with sentences. From my experience with adding data to models in these situations I'm quite certain this would work _significantly_ better.
One simple low-hanging-fruit approach would be to include a large repository of additional data (e.g. all of ImageNet) and label it all as a "garbage" class. This way the model could at least learn to distinguish the kinds of images in its training data from the universe of images, and this could be used as one proxy of confidence.
Another simple proxy is to look at the probability of the generated sample, since usually the model tends to assign more diffuse probabilities in more uncertain cases. But this is also not a very clean approach for various reasons.
Another, and probably most appealing, approach would be something along the lines of Bayesian Neural Networks, ensembles, or approximations with dropout, where the disagreement between the predictions of all submodels can be used.
Do you plan to add beamsearch?