undefined | Better HN

0 pointssnowstormsun2y ago0 comments

I didn't mean to insult anyone. The idea of not knowing the actual performance of the model just intuitively seems to me like it's a bit of a gamble. I have only trained models in a scientific context before, where this was never an option.

0 comments

4 comments · 1 top-level

DougBTX2y ago· 3 in thread

Here's another way to look at it. The test set is an approximation for how the model will perform against production data, but the actual performance of the model is how it performs for actual end-users. So real _actual_ results are always unknown util after the fact. Given that, if the metrics from training clearly show that more data == better model, and there's no reason to expect that trend to reverse, then the logical thing to do is maximise the data used for training to get the best results for actual production data.

Doing this does complicate decisions for releasing subsequent model updates, as the production model can't be directly compared against new iterations any more. Instead a pre-production model would need to be used, that has not seen the test set. However, if data drift is likely, then re-using the old test set wouldn't be useful anyway.

lumost2y ago

Another way of thinking about it. If training on all the data yields a model which is functionally 5% better in online metrics, which would not be uncommon in a pareto distributed traffic pattern - then any subsequent partitioned model would likely perform worse than the prod model.

More complication arises when users expect that things which worked previously in one way - continue working in this way. Users don't really care that their traffic was in the test set. In an even more extreme case, many industrial problems have a high correlation between the traffic today and the traffic next week, An optimal solution for such a situation would be to complete a full memorization today's traffic and use that for next week. In many cases, an overfit model can effectively perform this memorization task with fewer parameters/infrastructure than an actual dictionary lookup.

nightski2y ago

You act like training is this pre-set process you just "do". That's not the case, you train until you reach desired performance on the test set. If you don't have a test set how do you know when to stop training and avoid overfitting?

baobabKoodaa2y ago

You're confusing training epochs with dataset size.

I'm simplifying now, but you can think of epochs as "how many times we train over the entire dataset? 1 time? 10 times?"

Correspondingly, you can think of dataset size as "how many Wikipedia pages we include in the dataset? 1 million? 10 million?"

Now let's think about overfitting.

What happens when you increase epochs is the model is more likely to overfit your data.

What happens when you increase dataset size is the model is less likely to overfit your data.

1 more reply

j / k navigate · click thread line to collapse