ConStat: Performance-Based Contamination Detection in Large Language Models
Authors: Jasper Dekoninck, Mark Müller, Martin Vechev
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of CONSTAT in an extensive evaluation of diverse model architectures, benchmarks, and contamination scenarios and find high levels of contamination in multiple popular models including MISTRAL, LLAMA, YI, and the top-3 Open LLM Leaderboard models. |
| Researcher Affiliation | Collaboration | Jasper Dekoninck1, Mark Niklas Müller1,2, Martin Vechev1 Department of Computer Science1 Logic Star.ai2 ETH Zurich, Switzerland {jasper.dekoninck,martin.vechev}@inf.ethz.ch mark@logicstar.ai |
| Pseudocode | No | The paper includes a high-level illustration of the method in Figure 1, but it does not present structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code available at https://github.com/eth-sri/Con Stat. |
| Open Datasets | Yes | Benchmarks We select a diverse set of four of the most popular LLM benchmarks to evaluate CONSTAT: GSM8k [16] is a benchmark for mathematical reasoning, ARC-Challenge [15] is a multiple-choice benchmark for science questions, MMLU [26] is a multiple-choice general purpose benchmark and Hellaswag [54] is a dataset for commonsense natural language inference. |
| Dataset Splits | No | The paper states that samples were "split into two equally-sized sets, one of which was used for contaminating the fine-tuned models", and that it used a "5-shot setting" for evaluation. However, it does not explicitly provide percentages or counts for a distinct validation split for model training/evaluation. |
| Hardware Specification | Yes | We used a single Nvidia H100 GPU for around 1 month to finetune and evaluate all models. |
| Software Dependencies | No | The paper mentions using the "Hugging Face Transformers library [48]" and "Scipy [44]" but does not specify their version numbers. |
| Experiment Setup | Yes | Specifically, we applied full finetuning with batch size 16 and the Adam optimizer on different datasets and using different hyperparameters. We use the following default hyperparameters: A learning rate of 5e-5. The dataset on which we train is the contaminatable part of a given benchmark. We train for 5 epochs. The prompt includes the exact few-shot samples used for evaluation. |