Training language models to follow instructions with human feedback

Authors: Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, Ryan Lowe

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In human evaluations on our prompt distribution, outputs from the 1.3B parameter Instruct GPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.
Researcher Affiliation Industry Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright Pamela Mishkin Chong Zhang Sandhini Agarwal Katarina Slama Alex Ray John Schulman Jacob Hilton Fraser Kelton Luke Miller Maddie Simens Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan Lowe Primary authors. This was a joint project of the Open AI Alignment team. RL and JL are the team leads. Corresponding author: lowe@openai.com. Work done while at Open AI. Current affiliations: AA: Anthropic; PC: Alignment Research Center.
Pseudocode No The paper describes the methodology in text and with a diagram (Figure 2), but does not include any explicit pseudocode or algorithm blocks.
Open Source Code No Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No]
Open Datasets No Our prompt dataset consists primarily of text prompts submitted to a commercial language model API, as well as a small number of labeler-written prompts. These prompts are very diverse and include generation, question answering, dialog, summarization, extractions, and other natural language tasks (see Appendix A). Our dataset is over 96% English. We heuristically deduplicate prompts, and ensure that the validation and test sets contain no data from users whose data is in the training set. We also filter prompts containing personally identifiable information (PII).
Dataset Splits No We heuristically deduplicate prompts, and ensure that the validation and test sets contain no data from users whose data is in the training set.
Hardware Specification No Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] : we provide some info on the amount of compute used in the Discussion section.
Software Dependencies No The paper describes the algorithms and models used (e.g., PPO, GPT-3) but does not list specific software dependencies with version numbers (e.g., programming language versions, library versions) required for reproduction.
Experiment Setup Yes Supervised fine-tuning (SFT). We fine-tune GPT-3 on our labeler demonstrations using supervised learning. We trained for 16 epochs, using a cosine learning rate decay, and residual dropout of 0.2. We do our final SFT model selection based on the RM score on the validation set. [...] Reward modeling (RM). [...] we have labelers rank between K = 4 and K = 9 responses, and train on all K 2 comparisons from each prompt as a single batch element, for computational efficiency (see Appendix D). [...] Reinforcement learning (RL). Again following Stiennon et al. (2020), we fine-tuned the SFT model using PPO (Schulman et al., 2017). [...] In addition, we add a per-token KL penalty from the SFT model at each token to mitigate over-optimization of the reward model. The value function is initialized from the RM.