Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Weak-to-Strong Generalization Through the Data-Centric Lens

Authors: Changho Shin, John Cooper, Frederic Sala

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a theoretical result showing that the generalization benefit is a function of the overlap density and a regret bound for our data selection algorithm. Empirically, we validate the mechanism and the overlap detection algorithm on a wide array of settings.
Researcher Affiliation Academia Changho Shin, John Cooper, Frederic Sala Department of Computer Science University of Wisconsin-Madison EMAIL
Pseudocode Yes Algorithm 1 UCB-Based Data Selection for Maximizing Overlap Algorithm 2 Overlap Detection Algorithm
Open Source Code Yes Our code is available at https://github.com/Sprocket Lab/datacentric_w2s.
Open Datasets Yes For the language model experiments, we followed the setup described in Eleuther AI (2021). We used 19 datasets from Eleuther AI (2021). For the weak supervision setting, we used 9 datasets from the WRENCH weak supervision benchmark (Zhang et al., 2021).
Dataset Splits Yes We sampled ntrain = 10,000, nval = 1,000, and ntest = 5,000 for the training, validation, and test datasets, respectively, for datasets with larger splits than the specified sizes, in accordance with the default parameters provided in https://github.com/Eleuther AI/w2s.
Hardware Specification Yes Computing resources We used a GPU cluster with 8 NVIDIA A100 SXM2 40GB HBM2 NVLINK, 2x Intel Xeon Cascade Lake 5218 (2.3GHz) Processor (24-Core), 16x 32 GB ECC REG DDR4-2933 RAM.
Software Dependencies No The paper mentions software such as the "ruptures Python package (Truong et al., 2020)", "Snorkel (Ratner et al., 2018)", and "XGBoost (Chen & Guestrin, 2016)", but does not provide specific version numbers for these libraries or the Python interpreter itself.
Experiment Setup No The paper describes high-level experimental procedures and models used (e.g., "Qwen1.5 0.5B model as the weak model and the Llama3 8B model as the strong model", "linear probing"), but concrete hyperparameters like learning rate, batch size, or number of epochs are not specified in the main text or appendix sections reviewed for this question.