Safe RLHF: Safe Reinforcement Learning from Human Feedback
Authors: Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a three-round fine-tuning using Safe RLHF, we demonstrate a superior ability to mitigate harmful responses while enhancing performance compared to existing algorithms. Experimentally, we fine-tuned the Alpaca-7B using Safe RLHF and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human evaluations. |
| Researcher Affiliation | Academia | 1Center for AI Safety and Governance, Institute for AI, Peking University 2School of Computer Science, Peking University |
| Pseudocode | No | The paper describes the methods using equations and textual explanations, but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/PKU-Alignment/safe-rlhf. |
| Open Datasets | Yes | In the first iteration, our prompts were derived from open-source safety-related datasets referenced in Ganguli et al. (2022) and Sun et al. (2023a). |
| Dataset Splits | No | For both the reward model and cost model, the model selection primarily aims to achieve higher prediction accuracy. For different parameter training outcomes, we evaluate their predictive accuracy on a reserved test set and select the one with the highest accuracy for the next step. |
| Hardware Specification | Yes | All experiments in this paper utilized a large language model with 7 billion parameters. The server s CPU was an Intel(R) Xeon(R) Platinum 8378A CPU @ 3.00GHz with 64 cores, and the graphics cards were NVIDIA A800-SXM4-80GB 8, with NVLink support and the graphics driver version being 525.125.06. |
| Software Dependencies | Yes | the graphics driver version being 525.125.06. |
| Experiment Setup | Yes | The hyper-parameters utilized during the Safe RLHF training process are enumerated in Tables 2, 3, and 4. |