Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Relating Misfit to Gain in Weak-to-Strong Generalization Beyond the Squared Loss

Authors: Abhijeet Mulgund, Chirag Pabbaraju

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on synthetic data similar to Charikar et al. (2024), and also NLP and vision datasets considered originally in the work of Burns et al. (2024). ... In Figure 1, we plot the LHS on the y-axis and the RHS on the x-axis, for the experiment above performed with c = 2, 10, 50, 100.
Researcher Affiliation	Academia	1Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL, USA 2Department of Computer Science, Stanford University, Stanford, CA, USA.
Pseudocode	No	The paper describes methods and theoretical findings in text and mathematical formulas. It does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/abhmul/general-misfit-gain.
Open Datasets	Yes	We conduct experiments on synthetic data similar to Charikar et al. (2024), and also NLP and vision datasets considered originally in the work of Burns et al. (2024). ...four real-world NLP classification datasets: Bool Q (Clark et al., 2019), Sci Q (Welbl et al., 2017), Cosmos QA (Huang et al., 2019) and Amazon Polarity (Mc Auley & Leskovec, 2013). ... image classification datasets. Following Burns et al. (2024), we fix the weak supervisor to be Alex Net (Krizhevsky et al., 2012). ... We consider two datasets: CIFAR-10 (Krizhevsky et al., 2009) and Image Net (Russakovsky et al., 2015).
Dataset Splits	No	For each dataset, we first finetune the linear probe of the weak model (by minimizing cross-entropy) on 50% of the training data with ground-truth labels. We then compute weak labels given by the trained model on the remaining 50% of the training data. ... Finally, we estimate each of the terms in Equation (7) from the test data. While specific splitting is mentioned for the weak model training (50/50 split of training data), and test data is mentioned for evaluation, the paper does not provide comprehensive and specific train/validation/test splits for all datasets or the overall experimental setup.
Hardware Specification	No	The paper does not explicitly describe any specific hardware components (e.g., CPU models, GPU models, memory specifications, or cloud computing instances) used for its experiments.
Software Dependencies	No	The paper mentions various models (e.g., 'gpt2 family', 'Alex Net', 'Res Net-50', 'Vi T-B/8') but does not specify any software libraries, frameworks, or programming languages with version numbers required to replicate the experiments.
Experiment Setup	Yes	We set h : R8 R16 to a randomly initialized MLP with 5 hidden layers and Re LU activations, where the hidden size is 16. We choose F to be the set of c-class logistic regression models on R16. ... The marginal P of the data is N(0, ν2I); we set ν = 100. ... We set k = 100. ... We add an ℓ2 regularization penalty (with coefficient 0.1) on the linear weights in the objective...