To the Cutoff... and Beyond? A Longitudinal Perspective on LLM Data Contamination

Authors: Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White, Samuel Dooley

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models to look at benchmarks released over time. Specifically, we consider two code/mathematical problem-solving datasets, Codeforces and Project Euler, and we find statistically significant trends among LLM pass rate vs. Git Hub popularity and release date that provide strong evidence of contamination. By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models.
Researcher Affiliation Collaboration Manley Roberts1, Himanshu Thakur1,2, Christine Herlihy3, Colin White1, Samuel Dooley1 1Abacus.AI 2Carnegie Mellon University 3University of Maryland
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models. [...] Our treatment of datasets and our evaluation framework are available at https://github.com/ abacusai/to-the-cutoff. We release code and dataset contents to the extent possible while respecting the licensing requirements of the dataset owners.
Open Datasets Yes We focus on problems from the competitive programming website Codeforces (problems from 2010 2023) (Mirzayanov, 2023) and from the mathematical programming puzzle website Project Euler (problems from 2001-2023) (Hughes, 2023), building off analyses from (Cundy, 2023; He, 2023).
Dataset Splits No The paper analyzes pre-trained LLMs using a temporal partitioning of data (pre-cutoff and post-cutoff) for evaluation, rather than defining traditional train/validation/test splits for training a new model within the paper.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments or analyses.
Software Dependencies No The paper mentions software libraries like "numpy", "pandas", "NLTK", and "Whoosh" with citations, but it does not specify version numbers for these dependencies, which are required for reproducibility.
Experiment Setup No The paper describes the statistical models and variables used for its analysis, but it does not provide hyperparameters or system-level training settings in the typical sense (e.g., learning rate, batch size) as it evaluates pre-trained LLMs rather than training a new model itself.