ODIN: Disentangled Reward Mitigates Hacking in RLHF

Authors: Lichang Chen, Chen Zhu, Jiuhai Chen, Davit Soselia, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Based on this evaluation, we conduct large-scale studies, where the results shed insights into the efficacy of hyperparameters and tricks used in RL on mitigating length bias. Experiments demonstrate that our approach eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
Researcher Affiliation Collaboration 1University of Maryland, College Park 2Meta, work done while at Nvidia 3Nvidia.
Pseudocode Yes Algorithm 1 Proximal Policy Optimization for RLHF
Open Source Code No The paper does not provide an explicit statement or link confirming the release of their source code for the described methodology.
Open Datasets Yes We use the Open Assistant dataset (K opf et al., 2023)
Dataset Splits Yes We tried different learning rates from {1e 5, 3e 5, 5e 5} with batch size 128 for tuning both the baseline RM and ODIN on 22k preference data for 3 epochs, and picked the one with the highest validation accuracy for both.
Hardware Specification Yes All experiments are implemented with Deep Speed-Chat (Yao et al., 2023) and Huggingface Transformers (Wolf et al., 2020), running on 8 NVIDIA A100 80GB GPUs.
Software Dependencies No The paper mentions software like Deep Speed-Chat and Huggingface Transformers but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We search η {5e 7, 1e 6, 2e 6}, ϵ {0.1, 0.2, 0.4}, β {2.5e 3, 5e 3, 1e 2, 2e 2}, c {inf, 2, 4}, and N {32, 64, 256}. Note we did not finish all experiments with β = 2.5e 3, but we have included the partial results in the plots when β = 2.5e 3 is not explicitly excluded.