reproducibilityindex.ai

RRHF: Rank Responses to Align Language Models with Human Feedback

Authors: Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, Fei Huang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling. Extensive experiments show that the performance of RRHF is highly related to sampling quality which suggests RRHF is a best-of-n learner.
Researcher Affiliation	Collaboration	Hongyi Yuan12 Zheng Yuan1 Chuanqi Tan1 Wei Wang1 Songfang Huang1 Fei Huang1 1Alibaba DAMO Academy 2Tsinghua University
Pseudocode	No	The paper describes the mathematical formulation of its method but does not provide any pseudocode or a clearly labeled algorithm block.
Open Source Code	Yes	Codes are released at https://github.com/Ganjin Zero/RRHF.
Open Datasets	Yes	We use Anthropic s Helpful and Harmless (HH) dataset as our experiment dataset [3]5. [3] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Nova Das Sarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. ar Xiv preprint ar Xiv:2204.05862. 5https://huggingface.co/datasets/Dahoas/rm-static
Dataset Splits	No	The paper states it uses the Anthropic's Helpful and Harmless (HH) dataset [3] but does not explicitly provide the training, validation, and test splits (e.g., percentages or sample counts) for its main experiments. It mentions '25k training samples and each 5k sample set for validation and testing' for the IMDB dataset in Appendix C, but not for the HH dataset.
Hardware Specification	Yes	Sampling using vanilla beam search/diverse beam search/top-p sampling costs 4-6 hours on 8 80GB Nvidia A100 GPUs. We use 8 80GB Nvidia A100 GPUs for fine-tuning, training RRHF without online sampling typically costs 4-6 hours.
Software Dependencies	Yes	Ouyang et al. [22] and Ramamurthy et al. [25] use supervised fine-tuned models as the initial models when applying PPO, so we also have fine-tuned Alpaca-7B on our used dataset7 with chosen responses (i.e. human-preferred responses) following trl X[34] and name it as Alpaca-sft. [34] Leandro von Werra, et al. 2023. Carper AI/trlx: v0.6.0: LLa Ma (Alpaca), Benchmark Util, T5 ILQL, Tests.
Experiment Setup	Yes	We fine-tune RRHF with 3 epochs without early stopping. We first warm up the learning rate to 2e-5 and decay to 0 linearly. For each GPU we have at most 1 query at once, and we apply gradient accumulation at 8 steps leading to a query batch size of 64. The query and responses are truncated to 192 tokens.