undefined | Better HN

0 pointsdjsbshek5y ago0 comments

My main concerns with the imbalance are undersampling of the negative class data distribution relative to the positive class, and overestimating performance on the test splits. I can buy that you may want to train on a balanced dataset, but the testing condition should reflect the true case distribution as closely as possible.

I agree that you would not want to use only the class priors for prediction. However, I do not think it is clear that you would want to throw that information out. Also not sure that I agree with the statement that neural network has “no memory” of the prior class distribution. That is a strong claim to make about something as opaque as a neural net model.

0 comments

1 comments · 1 top-level

nil-sec5y ago

They could have used all negative samples for testing (and even training if they would have done it better), yes. But once your test set is large enough, whatever that means, its not that relevant anymore. They are anyway "under sampling" by not recording data from all humans that are negative right now.

And no, it's not a strong claim to make. Of course the network learns the distribution of your training set. That's why you want it balanced. But during successive applications of inference the weights do not change, it has no state. So it cannot, for example, store that it just predicted 90% negative and now it would be time again for some positive prediction.

j / k navigate · click thread line to collapse