Nash Learning from Human Feedback

Authors: Remi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Côme Fiegel, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J Mankowitz, Doina Precup, Bilal Piot

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate the effectiveness of our approach by presenting experimental results on a text summarization task.
Researcher Affiliation Collaboration 1Google Deep Mind 2ENSAE Paris Now at Cohere. Correspondence to: Remi Munos <remi.munos@inria.fr>, Michal Valko <michal.valko@inria.fr>, Daniele Calandriello <dcalandriello@google.com>, Bilal Piot <piot@google.com>.
Pseudocode No The paper describes algorithms mathematically and textually but does not include formal pseudocode blocks or sections labeled 'Algorithm'.
Open Source Code No The paper does not provide a statement about releasing its source code or a link to a code repository for the methodology described.
Open Datasets Yes In our experiments, we use the summarization dataset described in (Stiennon et al., 2020) that has been built from the TL;DR dataset (V olske et al., 2017).
Dataset Splits Yes We train our preference and reward models on the train set DTrain, that contains 92820 examples, and evaluate them on a test set of high confidence data DTest.
Hardware Specification No The paper mentions models like T5X-XL and Pa LM 2 Large, but it does not provide specific hardware details (e.g., GPU models, CPU types, memory) used for the experiments.
Software Dependencies No The paper mentions models and frameworks (e.g., T5X, Pa LM 2), but it does not list specific software dependencies with their version numbers required for reproducibility.
Experiment Setup Yes We conducted a sweep across a set of values 0.01, 0.02, 0.05, 0.1, 0.2 for the parameter τ of the KL-regularization. The value τ = 0.05 has been selected for the pairwise comparison table below. ... All models are trained for 10000 steps.