reproducibilityindex.ai

Task Contamination: Language Models May Not Be Few-Shot Anymore

Authors: Changmao Li, Jeffrey Flanigan

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over datasets released over time, and over LLMs released over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that datasets released prior to the LLM training data creation date perform surprisingly better than datasets released post the LLM training data creation date. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets prior to the LLMs training data creation date. Additionally, we utilize training data inspection, training data extraction, and a membership inference attack, which reveal further evidence of task contamination.
Researcher Affiliation	Academia	Changmao Li, Jeffrey Flanigan University of California, Santa Cruz changmao.li@ucsc.edu, jmflanig@ucsc.edu
Pseudocode	No	The paper describes its methods in prose but does not include any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide a concrete link or explicit statement about the availability of source code for the methodology described in this paper.
Open Datasets	Yes	Information about the datasets can be found in the Appendix, while release dates for each dataset are listed in Table 2.
Dataset Splits	No	The paper mentions 'zero-shot and few-shot evaluation' and categorizes datasets by release date relative to LLM training data, but it does not provide specific details on training/validation/test splits for the datasets used in its experiments.
Hardware Specification	No	We are thankful for the computing resources provided by the Pacific Research Platform s Nautilus cluster, supported in part by National Science Foundation (NSF) awards CNS-1730158, ACI-1540112, ACI-1541349, OAC-1826967, OAC-2112167, CNS-2100237, CNS-2120019, the University of California Office of the President, and the University of California San Diego s California Institute for Telecommunications and Information Technology/Qualcomm Institute.
Software Dependencies	No	The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup	No	The paper describes its analytical methods but does not provide specific details on hyperparameters or system-level training settings for its own experiments (e.g., prompting parameters for LLM evaluations, number of few-shot examples used beyond the general concept).