Multi-turn Reinforcement Learning with Preference Human Feedback

Authors: Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias, Remi Munos

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate performance, we create a new environment, Education Dialogue, where a teacher agent guides a student in learning a random topic, and show that a deep RL variant of our algorithm outperforms RLHF baselines. Finally, we show that in an environment with explicit rewards, our algorithm recovers the same performance as a reward-based RL baseline, despite relying solely on a weaker preference signal.
Researcher Affiliation Collaboration 1 Google Research. 2 Google Deep Mind. 3 Tel Aviv University.
Pseudocode Yes Appendix E.2 Algorithm 1: mirror descent policy optimization... Appendix E.4 Algorithm 2: mixture mirror descent policy optimization
Open Source Code No The NeurIPS Paper Checklist in the supplementary material explicitly states: "Due to technical difficulties, we currently do not release code.".
Open Datasets Yes For reproducibility, and to further advance the research of the multi-turn setting, we openly release the data of Education Dialogue.1 ... For this experiment, we utilize the LMRL-Gym [Abdulhai et al., 2023] Car Dealer environment, simulating a conversation where the agent (car dealer) aims to maximize the sale price.
Dataset Splits No The paper mentions using an "independent evaluation set" for testing and discusses training steps, but does not explicitly describe a separate "validation set" or "validation split" for hyperparameter tuning.
Hardware Specification Yes The agent and environment are modeled with T5 encoder-decoder models. Specifically, we use the T5-large (770M) and T5-XL (3B) models... For training, we use a configuration of 4 4 Tensor Processing Units (TPUs; Jouppi et al. [2023])... For prompted reward/preference models, we make use of the Flan-T5 XL (3B) [Chung et al., 2024].
Software Dependencies No The paper mentions using T5 encoder-decoder models (T5-large, T5-XL, Flan-T5 XL) but does not provide specific version numbers for these models or any other software libraries (e.g., PyTorch, TensorFlow, scikit-learn).
Experiment Setup Yes A detailed list of hyperparameters is found in Appendix D. ... Table 4: Hyperparameters of all multi-turn algorithms. (includes specific values for KL regularization coefficient, mixing coefficient, batch size, GAE coefficient, policy learning delay, optimizer, optimizer decay, policy learning rate, value learning rate).