>To avoid benchmark contamination, we follow Guo et al. (2024) to filter out web pages containing questions or answers from English mathematical benchmarks such as GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) and Chinese benchmarks such as CMATH (Wei et al., 2023) and AGIEval (Zhong et al., 2023). The filtering criteria are as follows: any text segment containing a 10-gram string that matches exactly with any sub-string from the evaluation benchmarks is removed from our math training corpus. For benchmark texts that are shorter than 10 grams but have at least 3 grams, we employ exact matching to filter out contaminated web pages.
However, benchmark contamination is difficult, and ngram matching is often insufficient. See https://arxiv.org/pdf/2311.04850.pdf for some examples of how this approach can fail.
In general, if a benchmark is available online before a model's dataset is collected, I put very little stock into that model's performance on that benchmark. It's just too hard to know what's a true improvement and what's contamination. It's especially true for a paper like this that specifically hunts down MATH-like data.
Unfortunately I'm just not aware of any metric that can adequately quantify meaningful similarity between data. Curse of dimensionality I suppose. Personally I try not to lean too hard on benchmark results not only because the aforementioned spoilage, but due to metric limitations as well. Personally I think our progress has out paced our ability to properly measure and it feels like we've only become more reliant upon them rather than more nuanced in our evaluations (am I alone in this?). I am wondering if this will create a stall or plateau (or even reversal) in practical performance as our measurements become less meaningful as our quality increases. I'm in vision, so a good example is how it is common to think that the L2 distance between a norm layer of a classification network (even if better than InceptionNet) is an accurate measurement of visual fidelity. Or to even think we have such metrics in even special cases (I guess PSNR or SSID are closest but that's more accurately described as reconstruction quality).
Btw, I think you might like the second paper I linked. It's a META/Stanford paper and mostly deals with vision (LAION) but a bit with C4. The short of it is that they can prune about 40% of LAION and still get good "Zeroshot" ImageNet accuracy. I actually found the results for random pruning quite enlightening, especially around all the toy datasets (Fig A4).
Zeroshot in quotes because it's pretty dubious to call ImageNet out of distribution (same with COCO) when a model is trained on LAION considering all the classes (at least an abstracted version of the class since LAION is more specific. i.e. ImageNet _distribution_ ⊂ LION _distribution_).
Another pet peeve of mine is arxiv links direct to PDF ;)
Interesting what's unsupported:
- In any way that violates any applicable national or international law or regulation or infringes upon the lawful rights and interests of any third party;
- For military use in any way;
- For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
- To generate or disseminate verifiably false information and/or content with the purpose of harming others;
- To generate or disseminate inappropriate content subject to applicable regulatory requirements;
- To generate or disseminate personal identifiable information without due authorization or for unreasonable use;
- To defame, disparage or otherwise harass others;
- For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation;
- For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics;
- To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
- For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories.
As a dumb Army guy if I were doing military research I would just keep it on my private military internet that does not exist for non-military users.
> We measure cross-contamination between our evaluation dataset and the pre-training data using substring match. Both evaluation and training data are processed by removing all spaces and symbols keeping only characters (including numbers). For each evaluation example, we randomly select three substrings of 50 characters (or use the entire example if it’s less than 50 characters). A match is identified if any of the three sampled evaluation substrings is a substring of the processed training example. This yields a list of contaminated examples. We discard these and rerun to get uncontaminated scores.
> The RLHF post-training dataset is vastly smaller than the pretraining set and unlikely to have any particular question contaminated. However we did not check explicitly.
These are not great at building confidence that OpenAI does not have spoilage. Given what we know about the dedupe process (even from early 2023) this is not enough to purge contamination. Exact string matching has been the de facto method for quite some time and for quite some time we've known that this has issues. Just that 5 years ago these issues weren't as critical as they are today because performance was much lower back then.
I think this is self-conflicting. If the evaluation is proprietary then it is most certainly not reputable. We'd want open metrics where we can analyze the limitations. Of course, we'd need open data too, but that's exceptionally rare these days. Plus, a metric isn't going to really tell us if we have have spoilage or not. You can get some evidence for spoilage through a trained model, but it is less direct, fuzzier, and more tells us about what information it was able to memorize rather than if the data was spoiled.