Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Multi-turn Reinforcement Learning with Preference Human Feedback
Authors: Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias, Remi Munos
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate performance, we create a new environment, Education Dialogue, where a teacher agent guides a student in learning a random topic, and show that a deep RL variant of our algorithm outperforms RLHF baselines. Finally, we show that in an environment with explicit rewards, our algorithm recovers the same performance as a reward-based RL baseline, despite relying solely on a weaker preference signal. |
| Researcher Affiliation | Collaboration | 1 Google Research. 2 Google Deep Mind. 3 Tel Aviv University. |
| Pseudocode | Yes | Appendix E.2 Algorithm 1: mirror descent policy optimization... Appendix E.4 Algorithm 2: mixture mirror descent policy optimization |
| Open Source Code | No | The NeurIPS Paper Checklist in the supplementary material explicitly states: "Due to technical difficulties, we currently do not release code.". |
| Open Datasets | Yes | For reproducibility, and to further advance the research of the multi-turn setting, we openly release the data of Education Dialogue.1 ... For this experiment, we utilize the LMRL-Gym [Abdulhai et al., 2023] Car Dealer environment, simulating a conversation where the agent (car dealer) aims to maximize the sale price. |
| Dataset Splits | No | The paper mentions using an "independent evaluation set" for testing and discusses training steps, but does not explicitly describe a separate "validation set" or "validation split" for hyperparameter tuning. |
| Hardware Specification | Yes | The agent and environment are modeled with T5 encoder-decoder models. Specifically, we use the T5-large (770M) and T5-XL (3B) models... For training, we use a configuration of 4 4 Tensor Processing Units (TPUs; Jouppi et al. [2023])... For prompted reward/preference models, we make use of the Flan-T5 XL (3B) [Chung et al., 2024]. |
| Software Dependencies | No | The paper mentions using T5 encoder-decoder models (T5-large, T5-XL, Flan-T5 XL) but does not provide specific version numbers for these models or any other software libraries (e.g., PyTorch, TensorFlow, scikit-learn). |
| Experiment Setup | Yes | A detailed list of hyperparameters is found in Appendix D. ... Table 4: Hyperparameters of all multi-turn algorithms. (includes specific values for KL regularization coefficient, mixing coefficient, batch size, GAE coefficient, policy learning delay, optimizer, optimizer decay, policy learning rate, value learning rate). |