Learning to summarize with human feedback

Authors: Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul F. Christiano

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts [63] and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles [22], producing summaries nearly as good as the human reference without any news-specific fine-tuning.2 We conduct extensive analyses to understand our human feedback dataset and fine-tuned models.3
Researcher Affiliation Industry Nisan Stiennon Long Ouyang Jeff Wu Daniel M. Ziegler Ryan Lowe Chelsea Voss Alec Radford Dario Amodei Paul Christiano This was a joint project of the Open AI Reflection team.
Pseudocode No The paper describes the procedure in prose and with diagrams (Figure 2), but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We provide inference code for our 1.3B models and baselines, as well as a model card and our human feedback dataset with over 64k summary comparisons, here.
Open Datasets Yes We use the TL;DR summarization dataset [63], which contains ~3 million posts from reddit.com across a variety of topics (subreddits), as well summaries of the posts written by the original poster (TL;DRs). ... We publicly release our human feedback dataset for further research. The dataset contains 64,832 summary comparisons on the TL;DR dataset, as well as our evaluation data on both TL;DR (comparisons and Likert scores) and CNN/DM (Likert scores).
Dataset Splits Yes Our final filtered dataset contains 123,169 posts, and we hold out ~5% as a validation set.
Hardware Specification No One limitation of our work is the time and cost required to produce our final models. Notably, fine-tuning our 6.7B model with RL required approximately 320 GPU-days.
Software Dependencies No The paper does not provide specific software dependencies (library names with version numbers) used for the experiments.
Experiment Setup Yes We primarily do this using reinforcement learning, by treating the output of the reward model as a reward for the entire summary that we maximize with the PPO algorithm [58], where each time step is a BPE token.8 We initialize our policy to be the model fine-tuned on Reddit TL;DR. Importantly, we include a term in the reward that penalizes the KL divergence between the learned RL policy πRL φ with parameters φ and this original supervised model πSFT, as previously done in [25]. The full reward R can be written as: R(x, y) = rθ(x, y) β log[πRL φ (y|x)/πSFT(y|x)] ... For the PPO value function, we use a Transformer with completely separate parameters from the policy. ... In our final human evaluations, we use T=0 to sample from all models...