REBEL: Reinforcement Learning via Regressing Relative Rewards
Authors: Zhaolin Gao, Jonathan Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, Drew Bagnell, Jason D. Lee, Wen Sun
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally efficient than PPO. When fine-tuning Llama-3-8B-Instruct, REBEL achieves strong performance in Alpaca Eval 2.0, MTBench, and Open LLM Leaderboard. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. |
| Researcher Affiliation | Collaboration | Zhaolin Gao1, Jonathan D. Chang2 , Wenhao Zhan3, Owen Oertell1, Gokul Swamy4, Kianté Brantley5, Thorsten Joachims1, J. Andrew Bagnell4,6, Jason D. Lee3, Wen Sun1 1 Cornell University, 2 Databricks Mosaic Research, 3 Princeton University, 4 Carnegie Mellon University, 5 Harvard University, 6 Aurora Innovation |
| Pseudocode | Yes | Algorithm 1 REgression to RElative REward Based RL (REBEL) |
| Open Source Code | Yes | Implementation of REBEL can be found at https://github.com/Zhaolin Gao/REBEL, and models trained by REBEL can be found at https://huggingface.co/Cornell-AGI. |
| Open Datasets | Yes | We use the TL;DR dataset (Stiennon et al., 2020)...', 'We adapt the setting from Zhu et al. (2023), using Open Chat-3.5 (Wang et al., 2024) as the base model, Starling-RM-7B-alpha (Zhu et al., 2023) as the reward model, and the Nectar dataset (Zhu et al., 2023).' and 'We compare REBEL with DPO which is also trained for one epoch on the entire dataset with best-of-5 as yw and worst-of-5 as yl sampled from π0. In other words, the training data used for the first iteration of REBEL is the same as the one we use for DPO3. We follow the same evaluation methods as the previous section and include Arena Hard (AH) (Li et al., 2024) in our analysis.' (from section 5.2.1) and 'Table 4: Dataset split, prompts, and maximum generation length for TL;DR summarization' |
| Dataset Splits | Yes | Table 4: Dataset split, prompts, and maximum generation length for TL;DR summarization Human Reference 117K/6.45K/6.55K |
| Hardware Specification | Yes | The 1.4B and 2.8B models are trained on 8 A6000 GPUs for one day and two days respectively. The 6.9B model is train on 8 H100 GPUs for two days. |
| Software Dependencies | No | The paper lists specific models and frameworks (e.g., Pythia, Llama-3-8B-Instruct, AdamW, LoRA) but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Appendix H.1.5 (Hyperparameter Details for TL;DR summarization) and Appendix H.2.4 (Hyperparameter Details for General Chat) and Appendix H.3.3 (Hyperparameter Details for Consistency Models) provide detailed information including batch size, learning rate, epochs, KL coefficient, η, LoRA configurations, and other specific parameters. |