ODIN: Disentangled Reward Mitigates Hacking in RLHF
Authors: Lichang Chen, Chen Zhu, Jiuhai Chen, Davit Soselia, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Based on this evaluation, we conduct large-scale studies, where the results shed insights into the efficacy of hyperparameters and tricks used in RL on mitigating length bias. Experiments demonstrate that our approach eliminates the reward correlation with length, and improves the obtained policy by a significant margin. |
| Researcher Affiliation | Collaboration | 1University of Maryland, College Park 2Meta, work done while at Nvidia 3Nvidia. |
| Pseudocode | Yes | Algorithm 1 Proximal Policy Optimization for RLHF |
| Open Source Code | No | The paper does not provide an explicit statement or link confirming the release of their source code for the described methodology. |
| Open Datasets | Yes | We use the Open Assistant dataset (K opf et al., 2023) |
| Dataset Splits | Yes | We tried different learning rates from {1e 5, 3e 5, 5e 5} with batch size 128 for tuning both the baseline RM and ODIN on 22k preference data for 3 epochs, and picked the one with the highest validation accuracy for both. |
| Hardware Specification | Yes | All experiments are implemented with Deep Speed-Chat (Yao et al., 2023) and Huggingface Transformers (Wolf et al., 2020), running on 8 NVIDIA A100 80GB GPUs. |
| Software Dependencies | No | The paper mentions software like Deep Speed-Chat and Huggingface Transformers but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We search η {5e 7, 1e 6, 2e 6}, ϵ {0.1, 0.2, 0.4}, β {2.5e 3, 5e 3, 1e 2, 2e 2}, c {inf, 2, 4}, and N {32, 64, 256}. Note we did not finish all experiments with β = 2.5e 3, but we have included the partial results in the plots when β = 2.5e 3 is not explicitly excluded. |