Learning to summarize with human feedback
Authors: Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul F. Christiano
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts [63] and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles [22], producing summaries nearly as good as the human reference without any news-specific fine-tuning.2 We conduct extensive analyses to understand our human feedback dataset and fine-tuned models.3 |
| Researcher Affiliation | Industry | Nisan Stiennon Long Ouyang Jeff Wu Daniel M. Ziegler Ryan Lowe Chelsea Voss Alec Radford Dario Amodei Paul Christiano This was a joint project of the Open AI Reflection team. |
| Pseudocode | No | The paper describes the procedure in prose and with diagrams (Figure 2), but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide inference code for our 1.3B models and baselines, as well as a model card and our human feedback dataset with over 64k summary comparisons, here. |
| Open Datasets | Yes | We use the TL;DR summarization dataset [63], which contains ~3 million posts from reddit.com across a variety of topics (subreddits), as well summaries of the posts written by the original poster (TL;DRs). ... We publicly release our human feedback dataset for further research. The dataset contains 64,832 summary comparisons on the TL;DR dataset, as well as our evaluation data on both TL;DR (comparisons and Likert scores) and CNN/DM (Likert scores). |
| Dataset Splits | Yes | Our final filtered dataset contains 123,169 posts, and we hold out ~5% as a validation set. |
| Hardware Specification | No | One limitation of our work is the time and cost required to produce our final models. Notably, fine-tuning our 6.7B model with RL required approximately 320 GPU-days. |
| Software Dependencies | No | The paper does not provide specific software dependencies (library names with version numbers) used for the experiments. |
| Experiment Setup | Yes | We primarily do this using reinforcement learning, by treating the output of the reward model as a reward for the entire summary that we maximize with the PPO algorithm [58], where each time step is a BPE token.8 We initialize our policy to be the model fine-tuned on Reddit TL;DR. Importantly, we include a term in the reward that penalizes the KL divergence between the learned RL policy πRL φ with parameters φ and this original supervised model πSFT, as previously done in [25]. The full reward R can be written as: R(x, y) = rθ(x, y) β log[πRL φ (y|x)/πSFT(y|x)] ... For the PPO value function, we use a Transformer with completely separate parameters from the policy. ... In our final human evaluations, we use T=0 to sample from all models... |