Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Weak-to-Strong Generalization Through the Data-Centric Lens

Authors: Changho Shin, John Cooper, Frederic Sala

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a theoretical result showing that the generalization benefit is a function of the overlap density and a regret bound for our data selection algorithm. Empirically, we validate the mechanism and the overlap detection algorithm on a wide array of settings.
Researcher Affiliation	Academia	Changho Shin, John Cooper, Frederic Sala Department of Computer Science University of Wisconsin-Madison EMAIL
Pseudocode	Yes	Algorithm 1 UCB-Based Data Selection for Maximizing Overlap Algorithm 2 Overlap Detection Algorithm
Open Source Code	Yes	Our code is available at https://github.com/Sprocket Lab/datacentric_w2s.
Open Datasets	Yes	For the language model experiments, we followed the setup described in Eleuther AI (2021). We used 19 datasets from Eleuther AI (2021). For the weak supervision setting, we used 9 datasets from the WRENCH weak supervision benchmark (Zhang et al., 2021).
Dataset Splits	Yes	We sampled ntrain = 10,000, nval = 1,000, and ntest = 5,000 for the training, validation, and test datasets, respectively, for datasets with larger splits than the specified sizes, in accordance with the default parameters provided in https://github.com/Eleuther AI/w2s.
Hardware Specification	Yes	Computing resources We used a GPU cluster with 8 NVIDIA A100 SXM2 40GB HBM2 NVLINK, 2x Intel Xeon Cascade Lake 5218 (2.3GHz) Processor (24-Core), 16x 32 GB ECC REG DDR4-2933 RAM.
Software Dependencies	No	The paper mentions software such as the "ruptures Python package (Truong et al., 2020)", "Snorkel (Ratner et al., 2018)", and "XGBoost (Chen & Guestrin, 2016)", but does not provide specific version numbers for these libraries or the Python interpreter itself.
Experiment Setup	No	The paper describes high-level experimental procedures and models used (e.g., "Qwen1.5 0.5B model as the weak model and the Llama3 8B model as the strong model", "linear probing"), but concrete hyperparameters like learning rate, batch size, or number of epochs are not specified in the main text or appendix sections reviewed for this question.