Quantifying the Gain in Weak-to-Strong Generalization

Authors: Moses Charikar, Chirag Pabbaraju, Kirankumar Shiragur

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our theoretical findings through various empirical assessments. ... We validate our characterization of the gain in weak-to-strong generalization through various experiments (Section 5) on synthetic and real-world data.
Researcher Affiliation Collaboration Moses Charikar Stanford University moses@cs.stanford.edu Chirag Pabbaraju Stanford University cpabbara@cs.stanford.edu Kirankumar Shiragur Microsoft Research kshiragur@microsoft.com
Pseudocode No The paper contains mathematical derivations and descriptions of procedures but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes A Python notebook representative of our main experiment (Figure 2a) is available at https://github.com/chogba/wtsg-regression.
Open Datasets Yes We consider three regression datasets: ESOL, Free Solv and Lipop. These datasets are part of the Molecule Net [WRF+18] benchmark suite, and have been curated into train, test and validation splits by Chem Bench [Wan20].
Dataset Splits Yes These datasets are part of the Molecule Net [WRF+18] benchmark suite, and have been curated into train, test and validation splits by Chem Bench [Wan20].
Hardware Specification Yes All our synthetic experiments were run on a personal Mac Book Pro 2021 with an Apple M1 Pro Chip (10 CPU cores) and no GPUs. ... The experiments on Mol BERT used 2 GPUs with 8 GB memory on an internal GPU cluster.
Software Dependencies No The paper mentions using the Adam optimizer [KB14] and models like BERT [DCLT18] but does not specify version numbers for these software components or other libraries used for the experiments.
Experiment Setup Yes All the gradient descent optimization procedures (pretraining tasks to obtain hw, hs, weak model finetuning, strong model finetuning on weak labels) used the Adam optimizer [KB14], with a batch size of 32, learning rate of 10 3 and 1000 epochs.