reproducibilityindex.ai

Learning to summarize with human feedback

Authors: Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul F. Christiano

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to ﬁne-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts [63] and ﬁnd that our models signiﬁcantly outperform both human reference summaries and much larger models ﬁne-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles [22], producing summaries nearly as good as the human reference without any news-speciﬁc ﬁne-tuning.2 We conduct extensive analyses to understand our human feedback dataset and ﬁne-tuned models.3
Researcher Affiliation	Industry	Nisan Stiennon Long Ouyang Jeff Wu Daniel M. Ziegler Ryan Lowe Chelsea Voss Alec Radford Dario Amodei Paul Christiano This was a joint project of the Open AI Reﬂection team.
Pseudocode	No	The paper describes the procedure in prose and with diagrams (Figure 2), but does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	We provide inference code for our 1.3B models and baselines, as well as a model card and our human feedback dataset with over 64k summary comparisons, here.
Open Datasets	Yes	We use the TL;DR summarization dataset [63], which contains ~3 million posts from reddit.com across a variety of topics (subreddits), as well summaries of the posts written by the original poster (TL;DRs). ... We publicly release our human feedback dataset for further research. The dataset contains 64,832 summary comparisons on the TL;DR dataset, as well as our evaluation data on both TL;DR (comparisons and Likert scores) and CNN/DM (Likert scores).
Dataset Splits	Yes	Our ﬁnal ﬁltered dataset contains 123,169 posts, and we hold out ~5% as a validation set.
Hardware Specification	No	One limitation of our work is the time and cost required to produce our ﬁnal models. Notably, ﬁne-tuning our 6.7B model with RL required approximately 320 GPU-days.
Software Dependencies	No	The paper does not provide specific software dependencies (library names with version numbers) used for the experiments.
Experiment Setup	Yes	We primarily do this using reinforcement learning, by treating the output of the reward model as a reward for the entire summary that we maximize with the PPO algorithm [58], where each time step is a BPE token.8 We initialize our policy to be the model ﬁne-tuned on Reddit TL;DR. Importantly, we include a term in the reward that penalizes the KL divergence between the learned RL policy πRL φ with parameters φ and this original supervised model πSFT, as previously done in [25]. The full reward R can be written as: R(x, y) = rθ(x, y) β log[πRL φ (y\|x)/πSFT(y\|x)] ... For the PPO value function, we use a Transformer with completely separate parameters from the policy. ... In our final human evaluations, we use T=0 to sample from all models...