reproducibilityindex.ai

Multi-turn Reinforcement Learning with Preference Human Feedback

Authors: Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias, Remi Munos

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate performance, we create a new environment, Education Dialogue, where a teacher agent guides a student in learning a random topic, and show that a deep RL variant of our algorithm outperforms RLHF baselines. Finally, we show that in an environment with explicit rewards, our algorithm recovers the same performance as a reward-based RL baseline, despite relying solely on a weaker preference signal.
Researcher Affiliation	Collaboration	1 Google Research. 2 Google Deep Mind. 3 Tel Aviv University.
Pseudocode	Yes	Appendix E.2 Algorithm 1: mirror descent policy optimization... Appendix E.4 Algorithm 2: mixture mirror descent policy optimization
Open Source Code	No	The NeurIPS Paper Checklist in the supplementary material explicitly states: "Due to technical difficulties, we currently do not release code.".
Open Datasets	Yes	For reproducibility, and to further advance the research of the multi-turn setting, we openly release the data of Education Dialogue.1 ... For this experiment, we utilize the LMRL-Gym [Abdulhai et al., 2023] Car Dealer environment, simulating a conversation where the agent (car dealer) aims to maximize the sale price.
Dataset Splits	No	The paper mentions using an "independent evaluation set" for testing and discusses training steps, but does not explicitly describe a separate "validation set" or "validation split" for hyperparameter tuning.
Hardware Specification	Yes	The agent and environment are modeled with T5 encoder-decoder models. Speciﬁcally, we use the T5-large (770M) and T5-XL (3B) models... For training, we use a conﬁguration of 4 4 Tensor Processing Units (TPUs; Jouppi et al. [2023])... For prompted reward/preference models, we make use of the Flan-T5 XL (3B) [Chung et al., 2024].
Software Dependencies	No	The paper mentions using T5 encoder-decoder models (T5-large, T5-XL, Flan-T5 XL) but does not provide specific version numbers for these models or any other software libraries (e.g., PyTorch, TensorFlow, scikit-learn).
Experiment Setup	Yes	A detailed list of hyperparameters is found in Appendix D. ... Table 4: Hyperparameters of all multi-turn algorithms. (includes specific values for KL regularization coefﬁcient, mixing coefﬁcient, batch size, GAE coefﬁcient, policy learning delay, optimizer, optimizer decay, policy learning rate, value learning rate).