reproducibilityindex.ai

ConStat: Performance-Based Contamination Detection in Large Language Models

Authors: Jasper Dekoninck, Mark Müller, Martin Vechev

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of CONSTAT in an extensive evaluation of diverse model architectures, benchmarks, and contamination scenarios and find high levels of contamination in multiple popular models including MISTRAL, LLAMA, YI, and the top-3 Open LLM Leaderboard models.
Researcher Affiliation	Collaboration	Jasper Dekoninck1, Mark Niklas Müller1,2, Martin Vechev1 Department of Computer Science1 Logic Star.ai2 ETH Zurich, Switzerland {jasper.dekoninck,martin.vechev}@inf.ethz.ch mark@logicstar.ai
Pseudocode	No	The paper includes a high-level illustration of the method in Figure 1, but it does not present structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code available at https://github.com/eth-sri/Con Stat.
Open Datasets	Yes	Benchmarks We select a diverse set of four of the most popular LLM benchmarks to evaluate CONSTAT: GSM8k [16] is a benchmark for mathematical reasoning, ARC-Challenge [15] is a multiple-choice benchmark for science questions, MMLU [26] is a multiple-choice general purpose benchmark and Hellaswag [54] is a dataset for commonsense natural language inference.
Dataset Splits	No	The paper states that samples were "split into two equally-sized sets, one of which was used for contaminating the fine-tuned models", and that it used a "5-shot setting" for evaluation. However, it does not explicitly provide percentages or counts for a distinct validation split for model training/evaluation.
Hardware Specification	Yes	We used a single Nvidia H100 GPU for around 1 month to finetune and evaluate all models.
Software Dependencies	No	The paper mentions using the "Hugging Face Transformers library [48]" and "Scipy [44]" but does not specify their version numbers.
Experiment Setup	Yes	Specifically, we applied full finetuning with batch size 16 and the Adam optimizer on different datasets and using different hyperparameters. We use the following default hyperparameters: A learning rate of 5e-5. The dataset on which we train is the contaminatable part of a given benchmark. We train for 5 epochs. The prompt includes the exact few-shot samples used for evaluation.