Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning to summarize with human feedback
Authors: Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul F. Christiano
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts [63] and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles [22], producing summaries nearly as good as the human reference without any news-specific fine-tuning.2 We conduct extensive analyses to understand our human feedback dataset and fine-tuned models.3 |
| Researcher Affiliation | Industry | Nisan Stiennon Long Ouyang Jeff Wu Daniel M. Ziegler Ryan Lowe Chelsea Voss Alec Radford Dario Amodei Paul Christiano This was a joint project of the Open AI Reflection team. |
| Pseudocode | No | The paper describes the procedure in prose and with diagrams (Figure 2), but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide inference code for our 1.3B models and baselines, as well as a model card and our human feedback dataset with over 64k summary comparisons, here. |
| Open Datasets | Yes | We use the TL;DR summarization dataset [63], which contains ~3 million posts from reddit.com across a variety of topics (subreddits), as well summaries of the posts written by the original poster (TL;DRs). ... We publicly release our human feedback dataset for further research. The dataset contains 64,832 summary comparisons on the TL;DR dataset, as well as our evaluation data on both TL;DR (comparisons and Likert scores) and CNN/DM (Likert scores). |
| Dataset Splits | Yes | Our final filtered dataset contains 123,169 posts, and we hold out ~5% as a validation set. |
| Hardware Specification | No | One limitation of our work is the time and cost required to produce our final models. Notably, fine-tuning our 6.7B model with RL required approximately 320 GPU-days. |
| Software Dependencies | No | The paper does not provide specific software dependencies (library names with version numbers) used for the experiments. |
| Experiment Setup | Yes | We primarily do this using reinforcement learning, by treating the output of the reward model as a reward for the entire summary that we maximize with the PPO algorithm [58], where each time step is a BPE token.8 We initialize our policy to be the model fine-tuned on Reddit TL;DR. Importantly, we include a term in the reward that penalizes the KL divergence between the learned RL policy πRL φ with parameters φ and this original supervised model πSFT, as previously done in [25]. The full reward R can be written as: R(x, y) = rθ(x, y) β log[πRL φ (y|x)/πSFT(y|x)] ... For the PPO value function, we use a Transformer with completely separate parameters from the policy. ... In our final human evaluations, we use T=0 to sample from all models... |