They train the model on only five frames, and then detect all close frames. They say that with five frames they are able to get 500 new labelled frames.
This means 100 new frames per each original frame. Because movies are at 24 frames per second, this in turn means that each original frame gives enough information for analyzing more or less 4 seconds of video on average.
As they show in the clips in the post, the short clips do indeed portrait Batman in very similar positions, with similar shading, light, etc. The micro model is able to detect Batman because the frames are all very similar to one another. It is very likely that this micro model as is wouldn't be able to detect Batman in a completely different scene of the movie.
So the model is indded over fitted, meaning that it is able to detect Batman in a very specific set of data. Of course over fitting can be done at different levels. They do not over fit to the point that they would be able to detect only the five original frames. The over fitting here stops when there is still scope for capturing new data with the over fitted model.
The smart idea of the authors is then to use these micro models to generate a lot of labeled data and "stich" together the micro models, so that they end up with a much larger data to train on, and a much more general model.
If overfitting is happening here then it wouldn't be beneficial. There is no reason to prefer that your model be better on the training set if you are going to use it to collect batman images across a film. It would be better if your model wasn't overfit, if it performed better on your dataset, then it would collect more images.
Huh? That's the opposite of the truth.
Compare https://en.wikipedia.org/wiki/Overfitting :
> In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably".
> The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.
Procedural concerns are not part of the concept. Conceptually, overfitting means including information in your model that isn't relevant to the prediction you're making, but that is helpful, by coincidence, in the data you're fitting the model to.
But since that can't be measured, instead, you measure overfitting through performance.
Is that happening when you train these micromodels? If not, I have a hard time seeing how it's overfitting because the model is still performing well for the data you train it on and use it on. If that is happening, then I don't see the benefit of it. A model that wasn't overfit would just do better at the task of collecting additional training data.
I think the approach you're talking about makes sense - create a simple model rapidly and leverage it to get more training data which you can then use to refine the model to be better still. I just don't think the term "overfitting" describes that process well - unless I'm misunderstanding something.
Couldn't you use this logic to say that AlphaGo is overfit because it can only play Go, not chess?
That being said, narrow training sets are a great idea and this application looks great.
Anyway, I’ve been trying to think of how this could be used for text data, specifically NER, which generally requires a lot more semantic understanding of the input. Sadly it seems like there might not be much room for the ‘micro’ part of the micro models.
However, it is hard to imagine an actual application of the process. If I understand it correctly, the author suggests using a set of micro-models for annotating a dataset which is then used to train another model. The latter model can actually detect Batman in a general environment, ie, can generalize. However, enriching a training dataset by adding adjacent frames depicting Batman from the same movie will likely have limited usefulness when training an actual Batman detection (non-micro!) model. Or do I get the final application wrong?
In general, the theory of what you are saying is correct that this method annotates data that is correlated with the original set, but practically it is still quite useful. Having more ground truth to work with gives a lot more practical flexibility with things like sampling, testing your model, randomization, and training more robust versions of your model.
Does it matter if it's technically overfitting or not if everyone understands what their "one specific thing" is and how to "stitch" them together to get accurate results over a some real-world problem space? (conversely, people have to recognize the limitations.) Also, for "micro-model" as a word, appreciate having neutral vocabulary to talk about a model that doesn't solve the whole problem space, but does work for some of it. As opposed to "overfit model" or "incomplete model", which seem to cast negative connotations on a concept which is potentially useful when properly applied. (Though an eventual consensus on vocabulary likely necessary as the space matures...)
Later parts of the article introduced kick-off, iteration, and prototyping time as concrete benefits. Interested to see a follow-up addressing how micro-models fit into general problem-solving pipeline. What's next in terms of speeding up the assembly-line process? Where do they fit into data-oriented programming on the whole?
To be fair, most of the industry are amateurs, but most people don’t write medium posts and continue to argue their ignorance on HN.