Improving Generalization of Alignment with Human Preferences through Group Invariant Learning
Authors: Rui Zheng, Wei Shen, Yuan Hua, Wenbin Lai, Shihan Dou, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Haoran Huang, Tao Gui, Qi Zhang, Xuanjing Huang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results indicate that our approach significantly enhances training stability and model generalization. |
| Researcher Affiliation | Collaboration | 1 Fudan University NLP Group 2 Byte Dance Inc |
| Pseudocode | Yes | A ALGORITHM Algorithm 1 Pseudocode for Policy Invariant Learning |
| Open Source Code | No | The paper does not provide a specific link to its own source code repository or explicitly state that its code is being released as open source. |
| Open Datasets | Yes | SFT dataset includes 52k user-shared conversations from various domains such as mathematics, knowledge querying, and coding, collected from Share GPT.com2. Human preference data: Anthropic-RLHF-HH dataset3 is used, which is a large-scale collection of human feedback on AI assistant responses, including both helpful and harmless data (Bai et al., 2022b). 2https://huggingface.co/datasets/anon8231489123/Share GPT Vicuna unfiltered 3https://huggingface.co/datasets/Anthropic/hh-rlhf |
| Dataset Splits | No | The paper mentions a 'validation set' in Figure 4, but does not provide specific details on how this validation split was created or its size relative to the full dataset, hindering reproducibility of the exact data partitioning. |
| Hardware Specification | Yes | Fine-tuning of the pre-trained models was conducted on a single node equipped with 8 A100-SXM80GB GPUs. |
| Software Dependencies | No | The paper mentions leveraging the 'Deepspeed Zero framework (Rajbhandari et al., 2020)' but does not provide specific version numbers for this or any other software dependencies, which are necessary for reproducible descriptions. |
| Experiment Setup | Yes | During training, a learning rate of 5e 6 was used, along with 2 epochs for the SFT phase and a global batch size of 32. For reward modeling, we employed a learning rate of 5e 6, a global batch size of 64, and trained the model on human preference datasets for only 1 epoch to prevent overoptimization issues. Regarding the PPO training, we utilized a learning rate of 5e 7 for the actor model and 9e 6 for the critic model. The number of epochs was set to 1, with a global batch size of 64. For each query, we collected 8 roll-out samples using nucleus sampling (Holtzman et al., 2020) for each GPU. The sampling temperature was set to 0.8, top-p was set to 0.9, repetition penalty was set to 1.1, and the maximum output token length was set to 512. The critic model was initialized with the weights of the reward model. A token-level KL penalty coefficient of 0.05 was applied, and the Generalized Advantage Estimation (Schulman et al., 2018) parameter λ was set to 0.95. The RL γ discount factor was set to 1. Additionally, reward score normalization and clipping were performed with a clip value of 5.0. The clipped surrogate objective was employed for both actor and critic optimization, with a clip value of 0.2. In the proposed method, βcritic is set to 1 and βpolicy is set to 0.01 |