Nash Learning from Human Feedback
Authors: Remi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Côme Fiegel, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J Mankowitz, Doina Precup, Bilal Piot
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate the effectiveness of our approach by presenting experimental results on a text summarization task. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2ENSAE Paris Now at Cohere. Correspondence to: Remi Munos <remi.munos@inria.fr>, Michal Valko <michal.valko@inria.fr>, Daniele Calandriello <dcalandriello@google.com>, Bilal Piot <piot@google.com>. |
| Pseudocode | No | The paper describes algorithms mathematically and textually but does not include formal pseudocode blocks or sections labeled 'Algorithm'. |
| Open Source Code | No | The paper does not provide a statement about releasing its source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | In our experiments, we use the summarization dataset described in (Stiennon et al., 2020) that has been built from the TL;DR dataset (V olske et al., 2017). |
| Dataset Splits | Yes | We train our preference and reward models on the train set DTrain, that contains 92820 examples, and evaluate them on a test set of high confidence data DTest. |
| Hardware Specification | No | The paper mentions models like T5X-XL and Pa LM 2 Large, but it does not provide specific hardware details (e.g., GPU models, CPU types, memory) used for the experiments. |
| Software Dependencies | No | The paper mentions models and frameworks (e.g., T5X, Pa LM 2), but it does not list specific software dependencies with their version numbers required for reproducibility. |
| Experiment Setup | Yes | We conducted a sweep across a set of values 0.01, 0.02, 0.05, 0.1, 0.2 for the parameter τ of the KL-regularization. The value τ = 0.05 has been selected for the pairwise comparison table below. ... All models are trained for 10000 steps. |