reproducibilityindex.ai

Nash Learning from Human Feedback

Authors: Remi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Côme Fiegel, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J Mankowitz, Doina Precup, Bilal Piot

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate the effectiveness of our approach by presenting experimental results on a text summarization task.
Researcher Affiliation	Collaboration	1Google Deep Mind 2ENSAE Paris Now at Cohere. Correspondence to: Remi Munos <remi.munos@inria.fr>, Michal Valko <michal.valko@inria.fr>, Daniele Calandriello <dcalandriello@google.com>, Bilal Piot <piot@google.com>.
Pseudocode	No	The paper describes algorithms mathematically and textually but does not include formal pseudocode blocks or sections labeled 'Algorithm'.
Open Source Code	No	The paper does not provide a statement about releasing its source code or a link to a code repository for the methodology described.
Open Datasets	Yes	In our experiments, we use the summarization dataset described in (Stiennon et al., 2020) that has been built from the TL;DR dataset (V olske et al., 2017).
Dataset Splits	Yes	We train our preference and reward models on the train set DTrain, that contains 92820 examples, and evaluate them on a test set of high confidence data DTest.
Hardware Specification	No	The paper mentions models like T5X-XL and Pa LM 2 Large, but it does not provide specific hardware details (e.g., GPU models, CPU types, memory) used for the experiments.
Software Dependencies	No	The paper mentions models and frameworks (e.g., T5X, Pa LM 2), but it does not list specific software dependencies with their version numbers required for reproducibility.
Experiment Setup	Yes	We conducted a sweep across a set of values 0.01, 0.02, 0.05, 0.1, 0.2 for the parameter τ of the KL-regularization. The value τ = 0.05 has been selected for the pairwise comparison table below. ... All models are trained for 10000 steps.