reproducibilityindex.ai

ODIN: Disentangled Reward Mitigates Hacking in RLHF

Authors: Lichang Chen, Chen Zhu, Jiuhai Chen, Davit Soselia, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Based on this evaluation, we conduct large-scale studies, where the results shed insights into the efficacy of hyperparameters and tricks used in RL on mitigating length bias. Experiments demonstrate that our approach eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
Researcher Affiliation	Collaboration	1University of Maryland, College Park 2Meta, work done while at Nvidia 3Nvidia.
Pseudocode	Yes	Algorithm 1 Proximal Policy Optimization for RLHF
Open Source Code	No	The paper does not provide an explicit statement or link confirming the release of their source code for the described methodology.
Open Datasets	Yes	We use the Open Assistant dataset (K opf et al., 2023)
Dataset Splits	Yes	We tried different learning rates from {1e 5, 3e 5, 5e 5} with batch size 128 for tuning both the baseline RM and ODIN on 22k preference data for 3 epochs, and picked the one with the highest validation accuracy for both.
Hardware Specification	Yes	All experiments are implemented with Deep Speed-Chat (Yao et al., 2023) and Huggingface Transformers (Wolf et al., 2020), running on 8 NVIDIA A100 80GB GPUs.
Software Dependencies	No	The paper mentions software like Deep Speed-Chat and Huggingface Transformers but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	We search η {5e 7, 1e 6, 2e 6}, ϵ {0.1, 0.2, 0.4}, β {2.5e 3, 5e 3, 1e 2, 2e 2}, c {inf, 2, 4}, and N {32, 64, 256}. Note we did not finish all experiments with β = 2.5e 3, but we have included the partial results in the plots when β = 2.5e 3 is not explicitly excluded.