Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Weak-to-Strong Generalization Even in Random Feature Networks, Provably
Authors: Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora, Zhiyuan Li, Nathan Srebro
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. We then show the quantitative limits of weak-to-strong generalization in this model... in Figure 2(a), we show that when the target f is linear, with proper early stopping time, the loss ratio LST/LTE decreases when MTE grows. Furthermore, in Figure 2(b), we observe the student loss LST is polynomially smaller than the teacher loss LTE with an estimated exponent even above our bound. |
| Researcher Affiliation | Collaboration | 1Department of Mathematics, University of Chicago, Chicago, IL, US 2Simons Institute, University of California, Berkeley, San Francisco, CA, US 3Microsoft Research, Seattle, WA, US 4Princeton University, Princeton, NJ, US 5TTIC, Chicago, IL, US. |
| Pseudocode | No | The paper describes mathematical models and derivations, including gradient flow dynamics and theoretical bounds. It does not, however, contain any clearly labeled pseudocode or algorithm blocks with structured steps for a method or procedure. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide any links to a code repository. The 'Experiment Details' section (J) describes how experiments were conducted but does not mention code availability. |
| Open Datasets | No | The paper defines theoretical models (Model 2.2 and Model 2.3) with specific activation functions and input distributions (e.g., 'σ(z) = max(z, 0) is a standard Re LU function, the bottom layer weights u are uniform over the sphere, i.e. U = Unif(Sd−1) in (2), and the inputs x are also uniformly distributed on the sphere, i.e. D = Unif(Sd−1)'). The 'experiments' are simulations based on these theoretical models and synthetic target functions, not actual publicly available datasets. No specific dataset names, links, DOIs, or citations are provided for external data sources. |
| Dataset Splits | No | The paper uses theoretical models and synthetic target functions for its simulations, rather than real-world datasets. Consequently, there is no mention of dataset splits such as training, validation, or test sets in the traditional sense, as the models are evaluated on population loss directly derived from the theoretical distributions. |
| Hardware Specification | Yes | All experiments are conducted on one H100 GPU |
| Software Dependencies | No | The paper mentions 'numpy.polyfit' in Section J 'Experiment Details' for curve fitting but does not provide specific version numbers for NumPy or any other software libraries used. This is insufficient for a reproducible description of software dependencies. |
| Experiment Setup | Yes | Figure 2. Weak-to-strong generalization happens in Re LU random feature networks (Model 2.2) with input dimension d = 32, student size MST = 16384, and teacher size MTE ∈ {16, . . . , 256}. We consider a linear target function f (x) = hβ, xi for unit norm some β. Figure 2(a) plots the ratio between student loss LST and teacher loss LTE, with varying teacher size MTE and gradient flow training time t. |