reproducibilityindex.ai

REBEL: Reinforcement Learning via Regressing Relative Rewards

Authors: Zhaolin Gao, Jonathan Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, Drew Bagnell, Jason D. Lee, Wen Sun

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally efficient than PPO. When fine-tuning Llama-3-8B-Instruct, REBEL achieves strong performance in Alpaca Eval 2.0, MTBench, and Open LLM Leaderboard. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature.
Researcher Affiliation	Collaboration	Zhaolin Gao1, Jonathan D. Chang2 , Wenhao Zhan3, Owen Oertell1, Gokul Swamy4, Kianté Brantley5, Thorsten Joachims1, J. Andrew Bagnell4,6, Jason D. Lee3, Wen Sun1 1 Cornell University, 2 Databricks Mosaic Research, 3 Princeton University, 4 Carnegie Mellon University, 5 Harvard University, 6 Aurora Innovation
Pseudocode	Yes	Algorithm 1 REgression to RElative REward Based RL (REBEL)
Open Source Code	Yes	Implementation of REBEL can be found at https://github.com/Zhaolin Gao/REBEL, and models trained by REBEL can be found at https://huggingface.co/Cornell-AGI.
Open Datasets	Yes	We use the TL;DR dataset (Stiennon et al., 2020)...', 'We adapt the setting from Zhu et al. (2023), using Open Chat-3.5 (Wang et al., 2024) as the base model, Starling-RM-7B-alpha (Zhu et al., 2023) as the reward model, and the Nectar dataset (Zhu et al., 2023).' and 'We compare REBEL with DPO which is also trained for one epoch on the entire dataset with best-of-5 as yw and worst-of-5 as yl sampled from π0. In other words, the training data used for the first iteration of REBEL is the same as the one we use for DPO3. We follow the same evaluation methods as the previous section and include Arena Hard (AH) (Li et al., 2024) in our analysis.' (from section 5.2.1) and 'Table 4: Dataset split, prompts, and maximum generation length for TL;DR summarization'
Dataset Splits	Yes	Table 4: Dataset split, prompts, and maximum generation length for TL;DR summarization Human Reference 117K/6.45K/6.55K
Hardware Specification	Yes	The 1.4B and 2.8B models are trained on 8 A6000 GPUs for one day and two days respectively. The 6.9B model is train on 8 H100 GPUs for two days.
Software Dependencies	No	The paper lists specific models and frameworks (e.g., Pythia, Llama-3-8B-Instruct, AdamW, LoRA) but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Appendix H.1.5 (Hyperparameter Details for TL;DR summarization) and Appendix H.2.4 (Hyperparameter Details for General Chat) and Appendix H.3.3 (Hyperparameter Details for Consistency Models) provide detailed information including batch size, learning rate, epochs, KL coefficient, η, LoRA configurations, and other specific parameters.