Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Training on the Benchmark Is Not All You Need

Authors: Shiwen Ni, Xiangtao Kong, Chengming Li, Xiping Hu, Ruifeng Xu, Jia Zhu, Min Yang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through experiments based on two LLMs and benchmark designs, we demonstrate the effectiveness of our method. In addition, we evaluate the degree of data leakage of 35 mainstream open-source LLMs on four benchmark datasets and give a ranking of the leaked LLMs for each benchmark... Experiment Experimental settings We randomly selected 1,000 pieces of data from MMLU, 500 of which were used for continuous pre-training of the LLa MA2-7b-base model, and then used these 1,000 pieces of data to test the pre-trained model... Experimental results The experimental results for scenario (a) are shown in Table 1.
Researcher Affiliation	Academia	1Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, CAS 2Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province, Zhejiang Normal University 3University of Science and Technology of China 4Shenzhen MSU-BIT University 5Harbin Institute of Technology (Shenzhen) EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Data Leakage Detection Under Scenario (a) Input: Data to be detected: x = [q, o1, o2, ..., on] Target Model: M Output: Whether the data was leaked ( L for Leaked, NL for Not Leaked)... Algorithm 2: Data Leakage Detection Under Scenario (b) Input: Data to be detected: x = [q, o1, o2, ..., on] Target Model: M Outlier threshold: δ Output: Whether the data was leaked ( L for Leaked, NL for Not Leaked)
Open Source Code	Yes	Code https://github.com/nishiwen1214/Benchmarkleakage-detection
Open Datasets	Yes	We conduct comprehensive data leakage detection experiments on four mainstream benchmarks: MMLU (Hendrycks et al. 2021a), CMMLU (Li et al. 2023), C-Eval (Huang et al. 2024), CMB (Wang et al. 2023).
Dataset Splits	Yes	We randomly selected 1,000 pieces of data from MMLU, 500 of which were used for continuous pre-training of the LLa MA2-7b-base model, and then used these 1,000 pieces of data to test the pre-trained model, detecting which of these 1,000 pieces of data had been trained.
Hardware Specification	No	The paper mentions fine-tuning and testing models like "LLa MA2-7b-base" and "Qwen2-7b-base" but does not specify the underlying hardware (e.g., GPU models, CPU types, memory) used for these operations.
Software Dependencies	No	The paper does not explicitly list any specific software dependencies (e.g., programming languages, libraries, frameworks) along with their version numbers.
Experiment Setup	Yes	Experimental results The experimental results for scenario (a) are shown in Table 1. ... For LLa MA2-7B, the detection accuracy and F1 exceeded 90% when the data were trained 10 times. ... For the determination of outliers, we chose three thresholds of -0.2, 0.17, and -0.15. ... The outlier threshold δ for our scenario b is set to 0.2 on the three benchmark test sets, MMLU, CMMLU, and C-Eval; since there are five options for the data in the CMB benchmark, its outlier threshold δ is set to 0.25.