Time Travel in LLMs: Tracing Data Contamination in Large Language Models
Authors: Shahriar Golchin, Mihai Surdeanu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our proposed methods in 28 distinct scenarios. These scenarios are created by two state-of-the-art LLMs: GPT-3.5 and GPT-4, and span seven datasets for classification, summarization, and natural language inference (NLI) tasks. The rationale behind the 28 scenarios is that for each dataset, we separately explore potential data contamination in the train and test splits (or the validation set, in cases where the labeled test set is not publicly available). Our evaluation indicates that our best method is the one that uses guided instruction to complete partial instances, and the one that evaluates these completions by the GPT-4 few-shot ICL classifier, achieving 92% 100% accuracy compared to contamination labels assigned by human experts for dataset partitions. |
| Researcher Affiliation | Academia | Shahriar Golchin , Mihai Surdeanu Department of Computer Science, University of Arizona {golchin,msurdeanu}@arizona.edu |
| Pseudocode | No | The paper refers to "Algorithm 1" and "Algorithm 2" and describes their logic in prose, but it does not include any structured pseudocode blocks or formal algorithm figures. |
| Open Source Code | Yes | 1See the paper’s repo at https://github.com/shahriargolchin/time-travel-in-llms. |
| Open Datasets | Yes | Our evaluation employs seven datasets derived from various tasks, namely classification, summarization, and NLI. The datasets in question involve IMDB (Maas et al. 2011), AG News (Zhang et al. 2015), Yelp Full Reviews (Zhang et al. 2015), SAMSum (Gliwa et al. 2019), XSum (Narayan et al. 2018), WNLI (Wang et al. 2018), and RTE (Wang et al. 2019). |
| Dataset Splits | Yes | In order to ensure a comprehensive experimental setup, all our experiments are carried out on both the training and test/validation splits of the aforesaid datasets. We make use of the publicly available divisions, working with the training and test splits for each. However, for the last two datasets, only the validation splits were publicly accessible with their labels. ... we randomly chose 10 instances from each split for our experiments. |
| Hardware Specification | No | The paper states that GPT-3.5 and GPT-4 were "accessed via the Open AI API," which implies using OpenAI's infrastructure. However, it does not specify any particular hardware (e.g., GPU models, CPU types, memory) used by the authors for their experiments or API access. |
| Software Dependencies | Yes | We use snapshots of GPT-3.5 and GPT-4 from June 13, 2023 specifically gpt-3.5-turbo-0613 and gpt-4-0613 both accessed via the Open AI API... We highlight that our BLEURT score computations use the most recent checkpoint provided, i.e., BLEURT-20 (Pu et al. 2021). |
| Experiment Setup | Yes | To obtain deterministic results, we set the temperature to zero and capped the maximum completion length at 500 tokens. ... For training, all default hyperparameters set by Open AI are maintained during our continued training phase. |