Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Weak-to-Strong Generalization Even in Random Feature Networks, Provably

Authors: Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora, Zhiyuan Li, Nathan Srebro

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. We then show the quantitative limits of weak-to-strong generalization in this model... in Figure 2(a), we show that when the target f is linear, with proper early stopping time, the loss ratio LST/LTE decreases when MTE grows. Furthermore, in Figure 2(b), we observe the student loss LST is polynomially smaller than the teacher loss LTE with an estimated exponent even above our bound.
Researcher Affiliation	Collaboration	1Department of Mathematics, University of Chicago, Chicago, IL, US 2Simons Institute, University of California, Berkeley, San Francisco, CA, US 3Microsoft Research, Seattle, WA, US 4Princeton University, Princeton, NJ, US 5TTIC, Chicago, IL, US.
Pseudocode	No	The paper describes mathematical models and derivations, including gradient flow dynamics and theoretical bounds. It does not, however, contain any clearly labeled pseudocode or algorithm blocks with structured steps for a method or procedure.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide any links to a code repository. The 'Experiment Details' section (J) describes how experiments were conducted but does not mention code availability.
Open Datasets	No	The paper defines theoretical models (Model 2.2 and Model 2.3) with specific activation functions and input distributions (e.g., 'σ(z) = max(z, 0) is a standard Re LU function, the bottom layer weights u are uniform over the sphere, i.e. U = Unif(Sd−1) in (2), and the inputs x are also uniformly distributed on the sphere, i.e. D = Unif(Sd−1)'). The 'experiments' are simulations based on these theoretical models and synthetic target functions, not actual publicly available datasets. No specific dataset names, links, DOIs, or citations are provided for external data sources.
Dataset Splits	No	The paper uses theoretical models and synthetic target functions for its simulations, rather than real-world datasets. Consequently, there is no mention of dataset splits such as training, validation, or test sets in the traditional sense, as the models are evaluated on population loss directly derived from the theoretical distributions.
Hardware Specification	Yes	All experiments are conducted on one H100 GPU
Software Dependencies	No	The paper mentions 'numpy.polyfit' in Section J 'Experiment Details' for curve fitting but does not provide specific version numbers for NumPy or any other software libraries used. This is insufficient for a reproducible description of software dependencies.
Experiment Setup	Yes	Figure 2. Weak-to-strong generalization happens in Re LU random feature networks (Model 2.2) with input dimension d = 32, student size MST = 16384, and teacher size MTE ∈ {16, . . . , 256}. We consider a linear target function f (x) = hβ, xi for unit norm some β. Figure 2(a) plots the ratio between student loss LST and teacher loss LTE, with varying teacher size MTE and gradient flow training time t.