undefined | Better HN

0 pointsgitgud2y ago0 comments

Is it possible that some LLM’s are trained on these benchmarks? Which would mean they’re overfitting and are incorrectly ranked? Or am I misunderstanding these benchmarks?…

0 comments

24 comments · 8 top-level

FanaHOVA2y ago· 7 in thread

Presented with no comment :) https://twitter.com/chhillee/status/1635790330854526981?s=46...

lumost2y ago

Having worked on ML products, there is sometimes debate on whether you should train on the test partition prior to prod deployment - after all, why would you ship a worse model to prod? Obviously you can't tell whether the model is better at generalization compared to an alternate technique, and you also incur some overfit risk. But many industrial problems are solvable through memorization.

sangnoir2y ago

> after all, why would you ship a worse model to prod?

...because you need a control to evaluate how well your product is doing? I know it's a young field, but boy, do some folk love removing the "science" from "data science"

baobabKoodaa2y ago

You can evaluate a version of the model that has been trained on one set of data, and ship to production a different model that has been trained on the complete set of data. In many cases one can reasonably infer that the model which has seen all of the data will be better than the model which has seen only some of the data.

I'm not claiming that's what happened here, nor am I interested in nitpicking "what counts as 'science'". I'm just saying this is a reasonable thing to do.

3 more replies

janalsncm2y ago

There are offline metrics and online metrics. Offline metrics might be something like AUROC on a test set. Once you’ve pushed the model online, you can check the online metrics. Ultimately the online metrics are more important, that’s the whole reason the model exists in the first place.

Your control in an online environment is the current baseline. You don’t need to save the test set anymore, you can push it online and test it directly.

snowstormsun2y ago

Why would you want to ship an untested model? That's insane.

baobabKoodaa2y ago

This is a common approach, for example, in data science competitions. Why? Well, if you want to maximize the model's abilities, this is what you have to do. (Not saying Llama 2 is released like this; it probably isn't)

1 more reply

sundarurfriend2y ago

Nitter link: https://nitter.net/chhillee/status/1635790330854526981/

famouswaffles2y ago· 4 in thread

Test leakage is not impossible for some benchmarks. But researchers try to avoid/mitigate that as much as possible for obvious reasons.

pclmulqdq2y ago

Given all of the times OpenAI has trained on peoples' examples of "bad" prompts, I am sure they are fine-tuning on these benchmarks. It's the natural thing to do if you are trying to position yourself as the "most accurate" AI.

famouswaffles2y ago

Assuming they were doing that, Fine-tuning on benchmarks isn't the same as test leakage/testing on training data. No researcher is intentionally training on test data.

If it performs about as well in instances it has never seen before (test set) then it's not overfit to the test.

nightski2y ago

I'm confused, fine-tuning is training. How is that not leakage? I'm hesitant to call them researchers, they are employees of a for-profit company trying to meet investor expectations.

2 more replies

TX81Z2y ago

“No researcher is intentionally training on test data.”

Citation Needed.

bbor2y ago· 4 in thread

It would be a bit of a scandal, and IMO too much hassle to sneak in. These models are trained on massive amounts of text - specifically anticipating which metrics people will care about and generating synthetic data just for them seems extra.

But not an expert or OP!

stu2b502y ago

I don't think it's a scandal, it's a natural thing that happens when iterating on models. OP doesn't mean they literally train on those tests, but that as a meta-consequence of using those tests as benchmarks, you will adjust the model and hyperparameters in ways that perform better on those tests.

For a particular model you try to minimally do this by separating a test and validation set, but on a meta-meta level, it's easy to see it happening.

jasonfarnon2y ago

You don't see an engineer at an extremely PR-conscious company at least checking how their model performs on popular benchmarks before rolling it out? And if its performance is lackluster, you do you really see them doing nothing about it? It probably doesn't make a huge difference anyway. I know those old vision models were overfitted to the standard image library benchmarks, but they were still very impressive.

fbdab1032y ago

Famously, some of the image models were so overtrained they could still yield impressive results if the colors were removed.

lumost2y ago

This wasn't so much overtraining, as the models learning something different than what we expected. If you look at a pixel by pixel representation of an image, textures tend to be more significant/unique patterns than shapes. There are some funny studies from the mid 2010s exploring this.

moneywoes2y ago· 1 in thread

How would it even be possible to verify that?

mdp20212y ago

"Verify", that's quite a demand;

"corroborate", you find queries of the same level which would give satisfactory output upon good performance but fail in a faulty overfitted model.

stevefan19992y ago

Unfortunately, Goodhart's law applies on most kind of tests

> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

iambateman2y ago

This is SAT-prep in a nutshell. :)

sp3322y ago

Yeah, it happens. https://hitz-zentroa.github.io/lm-contamination/blog/

option2y ago

that’s why OpenAI didn’t release any details on GPT4 training data blend ;)

j / k navigate · click thread line to collapse

0 comments

24 comments · 8 top-level

FanaHOVA2y ago· 7 in thread

Presented with no comment :) https://twitter.com/chhillee/status/1635790330854526981?s=46...

lumost2y ago

sangnoir2y ago

> after all, why would you ship a worse model to prod?

...because you need a control to evaluate how well your product is doing? I know it's a young field, but boy, do some folk love removing the "science" from "data science"

baobabKoodaa2y ago

I'm not claiming that's what happened here, nor am I interested in nitpicking "what counts as 'science'". I'm just saying this is a reasonable thing to do.

3 more replies

janalsncm2y ago

Your control in an online environment is the current baseline. You don’t need to save the test set anymore, you can push it online and test it directly.

snowstormsun2y ago

Why would you want to ship an untested model? That's insane.

baobabKoodaa2y ago

1 more reply

sundarurfriend2y ago

Nitter link: https://nitter.net/chhillee/status/1635790330854526981/

famouswaffles2y ago· 4 in thread

Test leakage is not impossible for some benchmarks. But researchers try to avoid/mitigate that as much as possible for obvious reasons.

pclmulqdq2y ago

famouswaffles2y ago

Assuming they were doing that, Fine-tuning on benchmarks isn't the same as test leakage/testing on training data. No researcher is intentionally training on test data.

If it performs about as well in instances it has never seen before (test set) then it's not overfit to the test.

nightski2y ago

I'm confused, fine-tuning is training. How is that not leakage? I'm hesitant to call them researchers, they are employees of a for-profit company trying to meet investor expectations.

2 more replies

TX81Z2y ago

“No researcher is intentionally training on test data.”

Citation Needed.

bbor2y ago· 4 in thread

But not an expert or OP!

stu2b502y ago

For a particular model you try to minimally do this by separating a test and validation set, but on a meta-meta level, it's easy to see it happening.

jasonfarnon2y ago

fbdab1032y ago

Famously, some of the image models were so overtrained they could still yield impressive results if the colors were removed.

lumost2y ago