Pretraining Language Models with Human Preferences
Authors: Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher Buckley, Jason Phang, Samuel R. Bowman, Ethan Perez
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the trade-off between alignment and capabilities of pretrained LMs. We find a Paretooptimal and simple approach among those we explored: conditional training, or learning distribution over tokens conditional on their human preference scores given by a reward model. |
| Researcher Affiliation | Collaboration | 1University of Sussex 2New York University 3FAR AI 4Northeastern University 5Anthropic. |
| Pseudocode | No | The paper describes the different pretraining objectives mathematically but does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | The code and datasets accompanying the paper are available at github.com/tomekkorbak/pretraining-with-human-feedback |
| Open Datasets | Yes | For toxicity and PII, we prepared training data by subsampling 1.95M documents (totaling 3.32B tokens) from the Pile (Gao et al., 2020). For code generation, we subsampled 1.5M Python files (again totaling 3.32B tokens) from a cleaned and filtered version of the Git Hub dataset from Google Big Query released by Tunstall et al. (2022). |
| Dataset Splits | Yes | We sweep hyperparameters for each GLUE task based on toxicity MLE-pretrained LM s dev set scores. ... We train each LM for each GLUE task for a maximum of 6 epochs with early stopping based on dev scores. |
| Hardware Specification | No | The paper mentions running experiments and references the compute-optimal scaling laws, but it does not specify any particular hardware components like GPU or CPU models, or cloud computing instance types. |
| Software Dependencies | No | The paper mentions various software tools used (e.g., Detoxify, SpaCy, Scrubadub, pycodestyle) but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We keep the original hyperparameters of gpt2-small except for learning rate and batch size, which we tune for each task-objective pair based on train loss. If an objective has it own hyperparameters (e.g. t, α or β), we tune learning rate and batch size separately for each (t, α, β) configuration considered and then chose the best (t, α, β) configuration based on misalignment score of LM samples and the KL divergence from GPT-3 ( 4.1). See Appendix B for hyperparameters used in experiments and ablations on them. |