Multi-turn Reinforcement Learning with Preference Human Feedback
Authors: Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias, Remi Munos
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate performance, we create a new environment, Education Dialogue, where a teacher agent guides a student in learning a random topic, and show that a deep RL variant of our algorithm outperforms RLHF baselines. Finally, we show that in an environment with explicit rewards, our algorithm recovers the same performance as a reward-based RL baseline, despite relying solely on a weaker preference signal. |
| Researcher Affiliation | Collaboration | 1 Google Research. 2 Google Deep Mind. 3 Tel Aviv University. |
| Pseudocode | Yes | Appendix E.2 Algorithm 1: mirror descent policy optimization... Appendix E.4 Algorithm 2: mixture mirror descent policy optimization |
| Open Source Code | No | The NeurIPS Paper Checklist in the supplementary material explicitly states: "Due to technical difficulties, we currently do not release code.". |
| Open Datasets | Yes | For reproducibility, and to further advance the research of the multi-turn setting, we openly release the data of Education Dialogue.1 ... For this experiment, we utilize the LMRL-Gym [Abdulhai et al., 2023] Car Dealer environment, simulating a conversation where the agent (car dealer) aims to maximize the sale price. |
| Dataset Splits | No | The paper mentions using an "independent evaluation set" for testing and discusses training steps, but does not explicitly describe a separate "validation set" or "validation split" for hyperparameter tuning. |
| Hardware Specification | Yes | The agent and environment are modeled with T5 encoder-decoder models. Specifically, we use the T5-large (770M) and T5-XL (3B) models... For training, we use a configuration of 4 4 Tensor Processing Units (TPUs; Jouppi et al. [2023])... For prompted reward/preference models, we make use of the Flan-T5 XL (3B) [Chung et al., 2024]. |
| Software Dependencies | No | The paper mentions using T5 encoder-decoder models (T5-large, T5-XL, Flan-T5 XL) but does not provide specific version numbers for these models or any other software libraries (e.g., PyTorch, TensorFlow, scikit-learn). |
| Experiment Setup | Yes | A detailed list of hyperparameters is found in Appendix D. ... Table 4: Hyperparameters of all multi-turn algorithms. (includes specific values for KL regularization coefficient, mixing coefficient, batch size, GAE coefficient, policy learning delay, optimizer, optimizer decay, policy learning rate, value learning rate). |