Improving Generalization of Alignment with Human Preferences through Group Invariant Learning

Authors: Rui Zheng, Wei Shen, Yuan Hua, Wenbin Lai, Shihan Dou, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Haoran Huang, Tao Gui, Qi Zhang, Xuanjing Huang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results indicate that our approach significantly enhances training stability and model generalization.
Researcher Affiliation Collaboration 1 Fudan University NLP Group 2 Byte Dance Inc
Pseudocode Yes A ALGORITHM Algorithm 1 Pseudocode for Policy Invariant Learning
Open Source Code No The paper does not provide a specific link to its own source code repository or explicitly state that its code is being released as open source.
Open Datasets Yes SFT dataset includes 52k user-shared conversations from various domains such as mathematics, knowledge querying, and coding, collected from Share GPT.com2. Human preference data: Anthropic-RLHF-HH dataset3 is used, which is a large-scale collection of human feedback on AI assistant responses, including both helpful and harmless data (Bai et al., 2022b). 2https://huggingface.co/datasets/anon8231489123/Share GPT Vicuna unfiltered 3https://huggingface.co/datasets/Anthropic/hh-rlhf
Dataset Splits No The paper mentions a 'validation set' in Figure 4, but does not provide specific details on how this validation split was created or its size relative to the full dataset, hindering reproducibility of the exact data partitioning.
Hardware Specification Yes Fine-tuning of the pre-trained models was conducted on a single node equipped with 8 A100-SXM80GB GPUs.
Software Dependencies No The paper mentions leveraging the 'Deepspeed Zero framework (Rajbhandari et al., 2020)' but does not provide specific version numbers for this or any other software dependencies, which are necessary for reproducible descriptions.
Experiment Setup Yes During training, a learning rate of 5e 6 was used, along with 2 epochs for the SFT phase and a global batch size of 32. For reward modeling, we employed a learning rate of 5e 6, a global batch size of 64, and trained the model on human preference datasets for only 1 epoch to prevent overoptimization issues. Regarding the PPO training, we utilized a learning rate of 5e 7 for the actor model and 9e 6 for the critic model. The number of epochs was set to 1, with a global batch size of 64. For each query, we collected 8 roll-out samples using nucleus sampling (Holtzman et al., 2020) for each GPU. The sampling temperature was set to 0.8, top-p was set to 0.9, repetition penalty was set to 1.1, and the maximum output token length was set to 512. The critic model was initialized with the weights of the reward model. A token-level KL penalty coefficient of 0.05 was applied, and the Generalized Advantage Estimation (Schulman et al., 2018) parameter λ was set to 0.95. The RL γ discount factor was set to 1. Additionally, reward score normalization and clipping were performed with a clip value of 5.0. The clipped surrogate objective was employed for both actor and critic optimization, with a clip value of 0.2. In the proposed method, βcritic is set to 1 and βpolicy is set to 0.01