Proving Test Set Contamination in Black-Box Language Models

Authors: Yonatan Oren, Nicole Meister, Niladri S. Chatterji, Faisal Ladhak, Tatsunori Hashimoto

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that it is possible to provide provable guarantees of test set contamination in language models without access to pretraining data or model weights. Our approach leverages the fact that when there is no data contamination, all orderings of an exchangeable benchmark should be equally likely. In contrast, the tendency for language models to memorize example order means that a contaminated language model will find certain canonical orderings to be much more likely than others. Our test flags potential contamination whenever the likelihood of a canonically ordered benchmark dataset is significantly higher than the likelihood after shuffling the examples. We demonstrate that our procedure is sensitive enough to reliably prove test set contamination in challenging situations, including models as small as 1.4 billion parameters, on small test sets of only 1000 examples, and datasets that appear only a few times in the pretraining corpus. Using our test, we audit four popular publicly accessible language models for test set contamination and find little evidence for pervasive contamination. In this work, we demonstrate that our test is effective for detecting many common forms of test set contamination. We begin by training a 1.4 billion parameter language model, consisting of both Wikipedia and a known collection of exchangeable test sets. These canaries serve as positive controls for our test, and our goal will be to flag as many of these as possible. Having validated the test in a setting with known contamination, we then explore its use with existing open models.
Researcher Affiliation Academia Yonatan Oren1 , Nicole Meister1 , Niladri Chatterji1 , Faisal Ladhak2, Tatsunori B. Hashimoto1 1Stanford University, 2Columbia University yonatano@cs.stanford.edu {nmeist, niladric, thashim}@stanford.edu faisal@cs.columbia.edu
Pseudocode Yes Algorithm 1 Sharded Rank Comparison Test
Open Source Code Yes To encourage the development of new provable guarantees for test set contamination, we release our pretrained models as a benchmark for developing future statistical tests.1. 1https://github.com/tatsu-lab/test_set_contamination
Open Datasets Yes We derive 10 test sets from numerous standard datasets (Bool Q (Clark et al., 2019), Hella Swag (Zellers et al., 2019), Openbook QA (Mihaylov et al., 2018b), MNLI (Williams et al., 2018), Natural Questions (Kwiatkowski et al., 2019a), Truthful QA (Lin et al., 2022), PIQA (Bisk et al., 2019), MMLU (Hendrycks et al., 2021)).
Dataset Splits No The paper mentions training a 1.4 billion parameter GPT-2 model from scratch and optimizing it, but does not explicitly state the use of a validation split during this training process. While a validation set is used for tuning a baseline method (Min-k%-Prob), this is not for their own model's training or evaluation.
Hardware Specification Yes We trained the model using Levanter on a v3-128 TPU instance on Google Cloud for 1.5 days (Hall et al. (2023)).
Software Dependencies No The paper mentions 'Levanter' and 'GPT-2 architecture' but does not specify version numbers for these or other software components.
Experiment Setup Yes We use a GPT-2 architecture with 1.4B parameters, with the architecture hyperparameters given by a hidden dimension of 1536, 24 heads, 48 layers, and a sequence length of 2048. The training batch size was 256. Based on the number of training tokens, sequence length, and training batch size, we trained this model for 46000 steps so as to consume the tokens in our mixture datasets exactly once. The model was optimized using Adam W with a learning rate of 1e-4 and weight decay of 0.1.