Micro-models: purposefully overfit models that are good at one specific thing (opens in new tab)

(eric-landau.medium.com)

72 pointsulrikhansen544y ago30 comments

30 comments

23 comments · 10 top-level

ALittleLight4y ago· 9 in thread

This doesn't really seem like "Overfitting" as I understand the concept. This is more like training a model to do a specific task rather than a more general task. Overfitting would be if your model started to memorize the training data - which doesn't seem to be what they are talking about and doesn't seem like it would be very useful..

ghego14y ago

The way I see it, this is over fitting.

They train the model on only five frames, and then detect all close frames. They say that with five frames they are able to get 500 new labelled frames.

This means 100 new frames per each original frame. Because movies are at 24 frames per second, this in turn means that each original frame gives enough information for analyzing more or less 4 seconds of video on average.

As they show in the clips in the post, the short clips do indeed portrait Batman in very similar positions, with similar shading, light, etc. The micro model is able to detect Batman because the frames are all very similar to one another. It is very likely that this micro model as is wouldn't be able to detect Batman in a completely different scene of the movie.

So the model is indded over fitted, meaning that it is able to detect Batman in a very specific set of data. Of course over fitting can be done at different levels. They do not over fit to the point that they would be able to detect only the five original frames. The over fitting here stops when there is still scope for capturing new data with the over fitted model.

The smart idea of the authors is then to use these micro models to generate a lot of labeled data and "stich" together the micro models, so that they end up with a much larger data to train on, and a much more general model.

ALittleLight4y ago

I agree that is what the post describes and that it could be a useful process. I don't think "overfitting" describes that process though. Overfitting describes increasing your performance on the training set to the extent that your model performs worse on the data it is used on.

If overfitting is happening here then it wouldn't be beneficial. There is no reason to prefer that your model be better on the training set if you are going to use it to collect batman images across a film. It would be better if your model wasn't overfit, if it performed better on your dataset, then it would collect more images.

elandau254y ago

Hi everyone, I wrote the article. I do consider this overfitting because we are training on these frames way more time than would be normally advised for the size of the training set such that the error is essentially zero for these frames. The model performs well in "out-of-sample" here but only out of sample that is semantically close to the original training set. Besides, overfitting is defined procedurally, not by how well it performs. You could have an overfit model that just happens to perform well on some stuff it was not trained on, that doesn't change the fact that the model was overfit.

thaumasiotes4y ago

> Besides, overfitting is defined procedurally, not by how well it performs.

Huh? That's the opposite of the truth.

Compare https://en.wikipedia.org/wiki/Overfitting :

> In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably".

> The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.

Procedural concerns are not part of the concept. Conceptually, overfitting means including information in your model that isn't relevant to the prediction you're making, but that is helpful, by coincidence, in the data you're fitting the model to.

But since that can't be measured, instead, you measure overfitting through performance.

1 more reply

ALittleLight4y ago

I would think a clear sign of overfitting is if your training loss is decreasing while your validation loss is increasing. That tells me your model is learning to perform better on the training data at a cost of performing worse on more general data.

Is that happening when you train these micromodels? If not, I have a hard time seeing how it's overfitting because the model is still performing well for the data you train it on and use it on. If that is happening, then I don't see the benefit of it. A model that wasn't overfit would just do better at the task of collecting additional training data.

I think the approach you're talking about makes sense - create a simple model rapidly and leverage it to get more training data which you can then use to refine the model to be better still. I just don't think the term "overfitting" describes that process well - unless I'm misunderstanding something.

elanning4y ago

As I understand it, overfitting is low bias, but high variance. It's perfectly fitting 5 linear data points with a complex polynomial, when the underlying function was a line. Thus the polynomial doesn't generalize well to more data points not in the training set. Your model seems to be fitting points in the training set and the evaluation set just fine. Of course if different batman's were in the evaluation set, it would suddenly be doing terrible, but you can pretty much do that to every machine learning model. It wouldn't fit a lot of underlying assumptions of statistics and machine learning, eg i.i.d and evaluation sets/training sets being from the same distribution. Your definition of overfitting thus seems more like transfer learning, in some sense.

1 more reply

MillenialMan4y ago

Is that overfitting? I thought overfitting was explicitly when your model fits itself so well to the training data that its ability to generalise to your target environment begins to decline. The size of the training set and the performance on it don't really matter, all that matters is that performance in the specific environment you intend to deploy the model in starts dropping. In your case, you're training the models for a very narrow deployment target, and they remain competent.

Couldn't you use this logic to say that AlphaGo is overfit because it can only play Go, not chess?

malux854y ago

This is not over-fitting, a fully overfit model is one that would only detect positives on the 5 or so frames that it was trained on, but this is clearly generalising well to unseen frames - many unseen frames - so yeah, like you say, it’s not overfit.

manojlds4y ago

Calling these as overfit is like saying a microservice for users can add only the specific 10 users and no one else.

brainwipe4y ago· 1 in thread

I'm not sure this is overfitting but a very narrow training set. It's still generalising against inputs it hasn't seen. If it was really overfitted then it wouldn't work for any unseen frames and it would be learning the "noise". It's not learning noise else you'd get lots of false positives, such as dark areas in the frame that look a bit like Batman but aren't. The main reason you want to generalise is noise rejection (no mention of this in the article). I think the S/N ratio in a video is exceptionally high as the dataset is directly repeatable so the source of truth is exceptionally accurate.

That being said, narrow training sets are a great idea and this application looks great.

catillac4y ago

Overfitting isn’t binary. Plenty of cases where something has overfit a bit and has a hard time generalizing to inputs in the distribution far from the limited training set, but is good at things near its distribution but unseen before. That’s what’s happening here.

l-lousy4y ago· 1 in thread

Interesting article, this also seems like a form of knowledge distillation. There have been a lot of examples of people distilling an ensemble into a single model, maybe you could try that here directly by taking out the middle man (match their outputs directly instead of labeling data).

Anyway, I’ve been trying to think of how this could be used for text data, specifically NER, which generally requires a lot more semantic understanding of the input. Sadly it seems like there might not be much room for the ‘micro’ part of the micro models.

elandau254y ago

That's a good idea. It will likely be trickier to apply a method like this for text. Decomposition of the problem is less obvious than with vision tasks.

jogundas4y ago· 1 in thread

A nice example of overfitting!

However, it is hard to imagine an actual application of the process. If I understand it correctly, the author suggests using a set of micro-models for annotating a dataset which is then used to train another model. The latter model can actually detect Batman in a general environment, ie, can generalize. However, enriching a training dataset by adding adjacent frames depicting Batman from the same movie will likely have limited usefulness when training an actual Batman detection (non-micro!) model. Or do I get the final application wrong?

elandau254y ago

Thanks! You have the application correct, but there are many ways by which we use this. An example is if you have trying to build models that require sequentially annotated images(like action recognition). Another is creating many micro-models that each only detect one type of object even though your general model will have to detect multiple objects.

In general, the theory of what you are saying is correct that this method annotates data that is correlated with the original set, but practically it is still quite useful. Having more ground truth to work with gives a lot more practical flexibility with things like sampling, testing your model, randomization, and training more robust versions of your model.

tomrod4y ago· 1 in thread

Neat concept. Suggestion to the author: show the out of sample fit stats and how the interpolation versus extrapolation regions are determined.

elandau254y ago

Thanks! I might do a follow-up article on the topic and will think about how to incorporate this in!

ruinar504y ago

Putting aside the detailed discussions on what exactly is "overfitting" for the moment, interested to hear more about the utility of micro-models in actual value delivery pipelines.

Does it matter if it's technically overfitting or not if everyone understands what their "one specific thing" is and how to "stitch" them together to get accurate results over a some real-world problem space? (conversely, people have to recognize the limitations.) Also, for "micro-model" as a word, appreciate having neutral vocabulary to talk about a model that doesn't solve the whole problem space, but does work for some of it. As opposed to "overfit model" or "incomplete model", which seem to cast negative connotations on a concept which is potentially useful when properly applied. (Though an eventual consensus on vocabulary likely necessary as the space matures...)

Later parts of the article introduced kick-off, iteration, and prototyping time as concrete benefits. Interested to see a follow-up addressing how micro-models fit into general problem-solving pipeline. What's next in terms of speeding up the assembly-line process? Where do they fit into data-oriented programming on the whole?

robojoker4y ago

This is an approach that I have used when doing attribution. Given error signals of a larger system, I couldn’t get great performance to attribute the errors to a particular broken component in the system. However, when I broke down that component into its set of particular issues and built a classifier per issue, I was able to get great performance. With the light weight models we used, it was straight forward to automate most of the training / validation of these component-issue specific models and decom them when the issue no longer existed (a fix was put in).

underaxon4y ago

I may be wrong but I think this is what kernel methods (eg. SVM) do, right? So this looks like a (deep)SVM where the kernels are small NNs.

klysm4y ago

I think the important piece missing from the headline is that these micro models are combined in ensemble like fashion. Because of that I wouldn’t really call it overfitting per se - more of a very restricted space to care about.

abz104y ago

Nothing new was discovered here and the key terminology is used incorrectly.

To be fair, most of the industry are amateurs, but most people don’t write medium posts and continue to argue their ignorance on HN.

j / k navigate · click thread line to collapse

30 comments

23 comments · 10 top-level

ALittleLight4y ago· 9 in thread

ghego14y ago

The way I see it, this is over fitting.

They train the model on only five frames, and then detect all close frames. They say that with five frames they are able to get 500 new labelled frames.

ALittleLight4y ago

elandau254y ago

thaumasiotes4y ago

> Besides, overfitting is defined procedurally, not by how well it performs.

Huh? That's the opposite of the truth.

Compare https://en.wikipedia.org/wiki/Overfitting :

> The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.

But since that can't be measured, instead, you measure overfitting through performance.

1 more reply

ALittleLight4y ago

elanning4y ago

1 more reply

MillenialMan4y ago

Couldn't you use this logic to say that AlphaGo is overfit because it can only play Go, not chess?

malux854y ago

manojlds4y ago

Calling these as overfit is like saying a microservice for users can add only the specific 10 users and no one else.

brainwipe4y ago· 1 in thread

That being said, narrow training sets are a great idea and this application looks great.

catillac4y ago

l-lousy4y ago· 1 in thread

elandau254y ago

That's a good idea. It will likely be trickier to apply a method like this for text. Decomposition of the problem is less obvious than with vision tasks.

jogundas4y ago· 1 in thread

A nice example of overfitting!

elandau254y ago

tomrod4y ago· 1 in thread

Neat concept. Suggestion to the author: show the out of sample fit stats and how the interpolation versus extrapolation regions are determined.

elandau254y ago

Thanks! I might do a follow-up article on the topic and will think about how to incorporate this in!

ruinar504y ago

Putting aside the detailed discussions on what exactly is "overfitting" for the moment, interested to hear more about the utility of micro-models in actual value delivery pipelines.

robojoker4y ago

underaxon4y ago

I may be wrong but I think this is what kernel methods (eg. SVM) do, right? So this looks like a (deep)SVM where the kernels are small NNs.

klysm4y ago

abz104y ago

Nothing new was discovered here and the key terminology is used incorrectly.

To be fair, most of the industry are amateurs, but most people don’t write medium posts and continue to argue their ignorance on HN.

j / k navigate · click thread line to collapse