Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Authors: Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeffrey Wu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization.
Researcher Affiliation Industry Collin Burns * 1 Pavel Izmailov * 1 Jan Hendrik Kirchner * 1 Bowen Baker * 1 Leo Gao * 1 Leopold Aschenbrenner * 1 Yining Chen * 1 Adrien Ecoffet * 1 Manas Joglekar * 1 Jan Leike 1 Ilya Sutskever 1 Jeff Wu * 1 1Superalignment Generalization Team, Open AI. Correspondence to: <generalization@openai.com>.
Pseudocode No The paper describes methods in text and uses equations, but does not contain any structured pseudocode or algorithm blocks clearly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes *Primary authors. Code is available at github.com/openai/weak-to-strong
Open Datasets Yes We consider 22 popular NLP classification datasets covering ethics, commonsense reasoning, natural language inference, sentiment analysis, etc. ... See Appendix Table 1 for a full list of the datasets and their sources. ... Chess puzzles. We use the dataset originally introduced in Schwarzschild et al. (2021b), which contains chess puzzles from the lichess.org website (lic). URL https://database.lichess.org/#puzzles.
Dataset Splits Yes To produce the weak labels, we split the original dataset in half. We ensure that related datapoints, e.g. datapoints that share the same question or premise, are always grouped together into the same half. Then, we train the weak supervisor model on the first half of the dataset, and use its prediction on the other half as the weak labels. ... In the weak-to-strong generalization experiments, we early stop training based on the accuracy with respect to the weak labels on a held-out validation set. ... We use 50k puzzles sampled randomly from the dataset as the training set for the weak models and another 50k for weak-to-strong finetuning, and evaluate on 5k puzzles.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, memory, or cloud instance types used for running the experiments. It only mentions using 'pretrained language models from the GPT-4 family (Open AI, 2023)'.
Software Dependencies No The paper mentions software components and libraries through citations (e.g., 'Adam optimizer (Kingma & Ba, 2014)', 'sklearn . linear model . Logistic Regression class (Pedregosa et al., 2011)'), but it does not specify any version numbers for these software dependencies, making it difficult to reproduce the exact software environment.
Experiment Setup Yes We finetune all models for 2 epochs using a batch size of 32. ... We train (finetune) all models for 5 epochs using a batch size of 32. ... We train for 1 epoch with a batch size of 220. ... For training the linear probes, we use a batch size of 128, Adam optimizer (Kingma & Ba, 2014) and a learning rate of 10 3. We run 20 epochs of training for Res Net-50 and 5 epochs for Vi T-B/8.