Sure, but I'm your sibling is essentially in agreement with what I said about the difficulties of deduplication and this is in direct contention with your initial response to me
> many serious submissions from serious groups for benchmarks like this check for contamination to specifically avoid the problem you’re suggesting
Despite explicitly linking a serious work (HumanEval is still a common dataset to use[0]) and how the second work demonstrates roughly half of LAION containing dupes. Since they demonstrate that hashing and exact dedupe isn't a great way to actually get the desired outcome.
I'm not sure I actually buy the strong claim you make of "serious groups" considering we find this in works like GPT.
Here's a quote from the GPT-4 technical paper (Appendix C): https://arxiv.org/abs/2303.08774
> We measure cross-contamination between our evaluation dataset and the pre-training data using substring match. Both evaluation and training data are processed by removing all spaces and symbols keeping only characters (including numbers). For each evaluation example, we randomly select three substrings of 50 characters (or use the entire example if it’s less than 50 characters). A match is identified if any of the three sampled evaluation substrings is a substring of the processed training example. This yields a list of contaminated examples. We discard these and rerun to get uncontaminated scores.
This does not seem a serious decontamination effort. It should not give confidence that the data is actually properly deduplicated/decontaminated. Especially when followed by
> The RLHF post-training dataset is vastly smaller than the pretraining set and unlikely to have any
particular question contaminated. However we did not check explicitly.
Not to mention for the explicit reasons given by your sibling and in my original comment. So I hope this clarifies what I mean by "handwavy trust claims" especially when I'm asking for explicit information about how this particular niche tends to do things. I'm left unconvinced that there is any model which doesn't include significant amounts of contamination.
But I'll give credit to the authors of this work, as there is a comment in the main thread (@deepseekfake) that acknowledges the spoilage. Though this is not mentioned on the repo and there is, as you may notice, not a paper linked.
[0] My gripe with HumanEval is that it is a mostly OpenAI work and the claim is that there is no contamination because they wrote the questions by hand. But you can literally go take snippets from the dataset, paste it into GitHub search, limit the search to before 2021 and find results.
Here's a blame from 9 years ago: https://github.com/bertomartin/stat4701/blame/ec2b64f629cbbf...
Matching question: #4 https://huggingface.co/datasets/openai_humaneval?row=4
Are these exact? No. Are they the same? Yes. Granted, GitHub search isn't very good but I didn't write the paper and this was only a very quick search. I think we should be unsurprising that we get dupes given the nature of the question and anyone that writes that line that thinks no one else ever possibly wrote the exact same line is fooling themselves. If you cared about variables names, give it something crazy unique to ensure this. But if we're talking semantically similar (which we should be considering augmentations that go into data) then yeah I think it is worth questioning someone's legitimacy if they think snippets like these are nowhere to be found on GitHub: "return number % 1.0 ", "return [x for x in strings if substring in x]", "return ' '.join([str(x) for x in range(n + 1)])", "return len(set(string.lower()))", "return len(string)", "for i in reversed(range(n)): if n % i == 0: return i", "return ''.join(strings)". I'm not even really cherry picking other than just looking for short answers so I can easily post here.