Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Quantifying the Gain in Weak-to-Strong Generalization
Authors: Moses Charikar, Chirag Pabbaraju, Kirankumar Shiragur
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our theoretical findings through various empirical assessments. ... We validate our characterization of the gain in weak-to-strong generalization through various experiments (Section 5) on synthetic and real-world data. |
| Researcher Affiliation | Collaboration | Moses Charikar Stanford University EMAIL Chirag Pabbaraju Stanford University EMAIL Kirankumar Shiragur Microsoft Research EMAIL |
| Pseudocode | No | The paper contains mathematical derivations and descriptions of procedures but does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | A Python notebook representative of our main experiment (Figure 2a) is available at https://github.com/chogba/wtsg-regression. |
| Open Datasets | Yes | We consider three regression datasets: ESOL, Free Solv and Lipop. These datasets are part of the Molecule Net [WRF+18] benchmark suite, and have been curated into train, test and validation splits by Chem Bench [Wan20]. |
| Dataset Splits | Yes | These datasets are part of the Molecule Net [WRF+18] benchmark suite, and have been curated into train, test and validation splits by Chem Bench [Wan20]. |
| Hardware Specification | Yes | All our synthetic experiments were run on a personal Mac Book Pro 2021 with an Apple M1 Pro Chip (10 CPU cores) and no GPUs. ... The experiments on Mol BERT used 2 GPUs with 8 GB memory on an internal GPU cluster. |
| Software Dependencies | No | The paper mentions using the Adam optimizer [KB14] and models like BERT [DCLT18] but does not specify version numbers for these software components or other libraries used for the experiments. |
| Experiment Setup | Yes | All the gradient descent optimization procedures (pretraining tasks to obtain hw, hs, weak model finetuning, strong model finetuning on weak labels) used the Adam optimizer [KB14], with a batch size of 32, learning rate of 10 3 and 1000 epochs. |