RRHF: Rank Responses to Align Language Models with Human Feedback

Authors: Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, Fei Huang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling. Extensive experiments show that the performance of RRHF is highly related to sampling quality which suggests RRHF is a best-of-n learner.
Researcher Affiliation Collaboration Hongyi Yuan12 Zheng Yuan1 Chuanqi Tan1 Wei Wang1 Songfang Huang1 Fei Huang1 1Alibaba DAMO Academy 2Tsinghua University
Pseudocode No The paper describes the mathematical formulation of its method but does not provide any pseudocode or a clearly labeled algorithm block.
Open Source Code Yes Codes are released at https://github.com/Ganjin Zero/RRHF.
Open Datasets Yes We use Anthropic s Helpful and Harmless (HH) dataset as our experiment dataset [3]5. [3] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Nova Das Sarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. ar Xiv preprint ar Xiv:2204.05862. 5https://huggingface.co/datasets/Dahoas/rm-static
Dataset Splits No The paper states it uses the Anthropic's Helpful and Harmless (HH) dataset [3] but does not explicitly provide the training, validation, and test splits (e.g., percentages or sample counts) for its main experiments. It mentions '25k training samples and each 5k sample set for validation and testing' for the IMDB dataset in Appendix C, but not for the HH dataset.
Hardware Specification Yes Sampling using vanilla beam search/diverse beam search/top-p sampling costs 4-6 hours on 8 80GB Nvidia A100 GPUs. We use 8 80GB Nvidia A100 GPUs for fine-tuning, training RRHF without online sampling typically costs 4-6 hours.
Software Dependencies Yes Ouyang et al. [22] and Ramamurthy et al. [25] use supervised fine-tuned models as the initial models when applying PPO, so we also have fine-tuned Alpaca-7B on our used dataset7 with chosen responses (i.e. human-preferred responses) following trl X[34] and name it as Alpaca-sft. [34] Leandro von Werra, et al. 2023. Carper AI/trlx: v0.6.0: LLa Ma (Alpaca), Benchmark Util, T5 ILQL, Tests.
Experiment Setup Yes We fine-tune RRHF with 3 epochs without early stopping. We first warm up the learning rate to 2e-5 and decay to 0 linearly. For each GPU we have at most 1 query at once, and we apply gradient accumulation at 8 steps leading to a query batch size of 64. The query and responses are truncated to 192 tokens.