Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Nash Learning from Human Feedback
Authors: Remi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, CΓ΄me Fiegel, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J Mankowitz, Doina Precup, Bilal Piot
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate the effectiveness of our approach by presenting experimental results on a text summarization task. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2ENSAE Paris Now at Cohere. Correspondence to: Remi Munos <EMAIL>, Michal Valko <EMAIL>, Daniele Calandriello <EMAIL>, Bilal Piot <EMAIL>. |
| Pseudocode | No | The paper describes algorithms mathematically and textually but does not include formal pseudocode blocks or sections labeled 'Algorithm'. |
| Open Source Code | No | The paper does not provide a statement about releasing its source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | In our experiments, we use the summarization dataset described in (Stiennon et al., 2020) that has been built from the TL;DR dataset (V olske et al., 2017). |
| Dataset Splits | Yes | We train our preference and reward models on the train set DTrain, that contains 92820 examples, and evaluate them on a test set of high confidence data DTest. |
| Hardware Specification | No | The paper mentions models like T5X-XL and Pa LM 2 Large, but it does not provide specific hardware details (e.g., GPU models, CPU types, memory) used for the experiments. |
| Software Dependencies | No | The paper mentions models and frameworks (e.g., T5X, Pa LM 2), but it does not list specific software dependencies with their version numbers required for reproducibility. |
| Experiment Setup | Yes | We conducted a sweep across a set of values 0.01, 0.02, 0.05, 0.1, 0.2 for the parameter Ο of the KL-regularization. The value Ο = 0.05 has been selected for the pairwise comparison table below. ... All models are trained for 10000 steps. |