Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Authors: Shahriar Golchin, Mihai Surdeanu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our proposed methods in 28 distinct scenarios. These scenarios are created by two state-of-the-art LLMs: GPT-3.5 and GPT-4, and span seven datasets for classification, summarization, and natural language inference (NLI) tasks. The rationale behind the 28 scenarios is that for each dataset, we separately explore potential data contamination in the train and test splits (or the validation set, in cases where the labeled test set is not publicly available). Our evaluation indicates that our best method is the one that uses guided instruction to complete partial instances, and the one that evaluates these completions by the GPT-4 few-shot ICL classifier, achieving 92% 100% accuracy compared to contamination labels assigned by human experts for dataset partitions.
Researcher Affiliation Academia Shahriar Golchin , Mihai Surdeanu Department of Computer Science, University of Arizona {golchin,msurdeanu}@arizona.edu
Pseudocode No The paper refers to "Algorithm 1" and "Algorithm 2" and describes their logic in prose, but it does not include any structured pseudocode blocks or formal algorithm figures.
Open Source Code Yes 1See the paper’s repo at https://github.com/shahriargolchin/time-travel-in-llms.
Open Datasets Yes Our evaluation employs seven datasets derived from various tasks, namely classification, summarization, and NLI. The datasets in question involve IMDB (Maas et al. 2011), AG News (Zhang et al. 2015), Yelp Full Reviews (Zhang et al. 2015), SAMSum (Gliwa et al. 2019), XSum (Narayan et al. 2018), WNLI (Wang et al. 2018), and RTE (Wang et al. 2019).
Dataset Splits Yes In order to ensure a comprehensive experimental setup, all our experiments are carried out on both the training and test/validation splits of the aforesaid datasets. We make use of the publicly available divisions, working with the training and test splits for each. However, for the last two datasets, only the validation splits were publicly accessible with their labels. ... we randomly chose 10 instances from each split for our experiments.
Hardware Specification No The paper states that GPT-3.5 and GPT-4 were "accessed via the Open AI API," which implies using OpenAI's infrastructure. However, it does not specify any particular hardware (e.g., GPU models, CPU types, memory) used by the authors for their experiments or API access.
Software Dependencies Yes We use snapshots of GPT-3.5 and GPT-4 from June 13, 2023 specifically gpt-3.5-turbo-0613 and gpt-4-0613 both accessed via the Open AI API... We highlight that our BLEURT score computations use the most recent checkpoint provided, i.e., BLEURT-20 (Pu et al. 2021).
Experiment Setup Yes To obtain deterministic results, we set the temperature to zero and capped the maximum completion length at 500 tokens. ... For training, all default hyperparameters set by Open AI are maintained during our continued training phase.