reproducibilityindex.ai

Improving Generalization of Alignment with Human Preferences through Group Invariant Learning

Authors: Rui Zheng, Wei Shen, Yuan Hua, Wenbin Lai, Shihan Dou, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Haoran Huang, Tao Gui, Qi Zhang, Xuanjing Huang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results indicate that our approach significantly enhances training stability and model generalization.
Researcher Affiliation	Collaboration	1 Fudan University NLP Group 2 Byte Dance Inc
Pseudocode	Yes	A ALGORITHM Algorithm 1 Pseudocode for Policy Invariant Learning
Open Source Code	No	The paper does not provide a specific link to its own source code repository or explicitly state that its code is being released as open source.
Open Datasets	Yes	SFT dataset includes 52k user-shared conversations from various domains such as mathematics, knowledge querying, and coding, collected from Share GPT.com2. Human preference data: Anthropic-RLHF-HH dataset3 is used, which is a large-scale collection of human feedback on AI assistant responses, including both helpful and harmless data (Bai et al., 2022b). 2https://huggingface.co/datasets/anon8231489123/Share GPT Vicuna unfiltered 3https://huggingface.co/datasets/Anthropic/hh-rlhf
Dataset Splits	No	The paper mentions a 'validation set' in Figure 4, but does not provide specific details on how this validation split was created or its size relative to the full dataset, hindering reproducibility of the exact data partitioning.
Hardware Specification	Yes	Fine-tuning of the pre-trained models was conducted on a single node equipped with 8 A100-SXM80GB GPUs.
Software Dependencies	No	The paper mentions leveraging the 'Deepspeed Zero framework (Rajbhandari et al., 2020)' but does not provide specific version numbers for this or any other software dependencies, which are necessary for reproducible descriptions.
Experiment Setup	Yes	During training, a learning rate of 5e 6 was used, along with 2 epochs for the SFT phase and a global batch size of 32. For reward modeling, we employed a learning rate of 5e 6, a global batch size of 64, and trained the model on human preference datasets for only 1 epoch to prevent overoptimization issues. Regarding the PPO training, we utilized a learning rate of 5e 7 for the actor model and 9e 6 for the critic model. The number of epochs was set to 1, with a global batch size of 64. For each query, we collected 8 roll-out samples using nucleus sampling (Holtzman et al., 2020) for each GPU. The sampling temperature was set to 0.8, top-p was set to 0.9, repetition penalty was set to 1.1, and the maximum output token length was set to 512. The critic model was initialized with the weights of the reward model. A token-level KL penalty coefficient of 0.05 was applied, and the Generalized Advantage Estimation (Schulman et al., 2018) parameter λ was set to 0.95. The RL γ discount factor was set to 1. Additionally, reward score normalization and clipping were performed with a clip value of 5.0. The clipped surrogate objective was employed for both actor and critic optimization, with a clip value of 0.2. In the proposed method, βcritic is set to 1 and βpolicy is set to 0.01