undefined | Better HN

0 pointsgodelski2y ago0 comments

So I am a ML researcher. Note that part of my comment is specifying how difficult it actually is to ensure lack of spoilage. The second paper I link is actually a pretty good proof of this. Though I'll say that I wish they had been a bit more explicit about how a random pruning significantly improves results. Because that is quite the result in of itself, given that the datasets they look at are already filtered. Dedupe is fucking hard. So I'm not looking for a handwavy "trust me" I'm looking for the explicit vetting processes applied to these specific datasets. It's incredibly important to know the limits of your tools and that includes datasets (as well as metrics).

0 comments

3 comments · 1 top-level

ianbutler2y ago· 2 in thread

As my sibling comment notes their decontamination process was outlined in the paper and you can reproduce it, though it may not be sufficient. That was my initial point, I wasn’t giving you handwavy trust claims, I was saying you’d likely find it in the paper.

godelskiOP2y ago

Sure, but I'm your sibling is essentially in agreement with what I said about the difficulties of deduplication and this is in direct contention with your initial response to me

> many serious submissions from serious groups for benchmarks like this check for contamination to specifically avoid the problem you’re suggesting

Despite explicitly linking a serious work (HumanEval is still a common dataset to use[0]) and how the second work demonstrates roughly half of LAION containing dupes. Since they demonstrate that hashing and exact dedupe isn't a great way to actually get the desired outcome.

I'm not sure I actually buy the strong claim you make of "serious groups" considering we find this in works like GPT.

Here's a quote from the GPT-4 technical paper (Appendix C): https://arxiv.org/abs/2303.08774

> We measure cross-contamination between our evaluation dataset and the pre-training data using substring match. Both evaluation and training data are processed by removing all spaces and symbols keeping only characters (including numbers). For each evaluation example, we randomly select three substrings of 50 characters (or use the entire example if it’s less than 50 characters). A match is identified if any of the three sampled evaluation substrings is a substring of the processed training example. This yields a list of contaminated examples. We discard these and rerun to get uncontaminated scores.

This does not seem a serious decontamination effort. It should not give confidence that the data is actually properly deduplicated/decontaminated. Especially when followed by

> The RLHF post-training dataset is vastly smaller than the pretraining set and unlikely to have any particular question contaminated. However we did not check explicitly.

Not to mention for the explicit reasons given by your sibling and in my original comment. So I hope this clarifies what I mean by "handwavy trust claims" especially when I'm asking for explicit information about how this particular niche tends to do things. I'm left unconvinced that there is any model which doesn't include significant amounts of contamination.

But I'll give credit to the authors of this work, as there is a comment in the main thread (@deepseekfake) that acknowledges the spoilage. Though this is not mentioned on the repo and there is, as you may notice, not a paper linked.

[0] My gripe with HumanEval is that it is a mostly OpenAI work and the claim is that there is no contamination because they wrote the questions by hand. But you can literally go take snippets from the dataset, paste it into GitHub search, limit the search to before 2021 and find results.

Here's a blame from 9 years ago: https://github.com/bertomartin/stat4701/blame/ec2b64f629cbbf...

Matching question: #4 https://huggingface.co/datasets/openai_humaneval?row=4

Are these exact? No. Are they the same? Yes. Granted, GitHub search isn't very good but I didn't write the paper and this was only a very quick search. I think we should be unsurprising that we get dupes given the nature of the question and anyone that writes that line that thinks no one else ever possibly wrote the exact same line is fooling themselves. If you cared about variables names, give it something crazy unique to ensure this. But if we're talking semantically similar (which we should be considering augmentations that go into data) then yeah I think it is worth questioning someone's legitimacy if they think snippets like these are nowhere to be found on GitHub: "return number % 1.0 ", "return [x for x in strings if substring in x]", "return ' '.join([str(x) for x in range(n + 1)])", "return len(set(string.lower()))", "return len(string)", "for i in reversed(range(n)): if n % i == 0: return i", "return ''.join(strings)". I'm not even really cherry picking other than just looking for short answers so I can easily post here.

ianbutler2y ago

It isn't in direct contention, I'm not certain you're reading my original message correctly. I never made any claim about difficulty though you're welcome to show that claim to me, that is your insertion. I said that serious groups do it and make their methodology available. You're welcome to reproduce these things yourself to find fault, which you clearly do.

I will say I didn't intend to debate this point. People try to solve the problem It may not be satisfactory or meet your standards, and I can't do much for your there. Sorry and good luck.

1 more reply

j / k navigate · click thread line to collapse