undefined | Better HN

0 pointskarpathy10y ago0 comments

It's not exactly overfitting. It's more that the model is seeing an image that is out of sample, but our approach _forces_ it to say something, so it will do its best but usually fail.

One simple low-hanging-fruit approach would be to include a large repository of additional data (e.g. all of ImageNet) and label it all as a "garbage" class. This way the model could at least learn to distinguish the kinds of images in its training data from the universe of images, and this could be used as one proxy of confidence.

Another simple proxy is to look at the probability of the generated sample, since usually the model tends to assign more diffuse probabilities in more uncertain cases. But this is also not a very clean approach for various reasons.

Another, and probably most appealing, approach would be something along the lines of Bayesian Neural Networks, ensembles, or approximations with dropout, where the disagreement between the predictions of all submodels can be used.

0 comments

3 comments · 2 top-level

kastnerkyle10y ago· 1 in thread

X. Zhang and Y. LeCun had an article [1] about using this technique for regularization very recently. We aren't talking specifically about overfitting in this case, but rather a lack of diversity / size in the training data, however it seems like this kind of thing might help for a variety of tasks.

One more idea would be to have a (non-differentiable / REINFORCE?) penalty based on sentence likelihood using doc2vec or skip-thoughts to avoid the "blah and a blah on a blah" type errors that seem to be common in captioning.

One more would be to use TV / YouTube captions, but that data is extremely noisy - even more the COCO captions, unfortunately!

[1] http://arxiv.org/abs/1511.03719

emcq10y ago

I get excited whenever we can improve performance with manipulations to the training sets such as data transformation tricks (inducing variances like rotations, translations, whitening [1] etc), labeling tricks (such as those like LeCun or [2]), or including information/learnings/regularizations from external corpus like doc2vec.

It feels like getting something for free :) Of course there is a limit to how much signal you can extract from a noisy dataset, but the amount of time and human energy invested into creating and improving datasets can be quite large relative to finding another cool trick that can improve performance.

However, I wonder which will come first to make these systems "robust" for the average joe's real world uses for these perceptual systems; a large, well labeled dataset or more transformations and semi-supervised learning approaches?

[1] http://www.cs.stanford.edu/~acoates/papers/coatesleeng_aista... [2] http://cseweb.ucsd.edu/~elkan/posonly.pdf

singhrac10y ago

Can you explain what the various reasons for not using a probability baseline (i.e., only if the model is pretty confident assign a label, otherwise say nothing) are? It's what we're being taught right now in ML classes, so I would like to fix my understanding.

j / k navigate · click thread line to collapse