Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Authors: Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on detoxification and long-form question answering to illustrate how learning with such reward functions leads to improved performance, supported by both automatic and human evaluation.
Researcher Affiliation Collaboration Zeqiu Wu1 Yushi Hu1 Weijia Shi1 Nouha Dziri2 Alane Suhr3 Prithviraj Ammanabrolu45 Noah A. Smith12 Mari Ostendorf1 Hannaneh Hajishirzi12 1University of Washington 2Allen Institute for Artificial Intelligence 3University of California, Berkeley 4University of California, San Diego 5Mosaic ML
Pseudocode Yes Algorithm 1 Fine-Grained Reinforcement Learning from Human Feedback (FINE-GRAINED RLHF) Input initial policy model Pθinit; initial value model Vψinit; K reward models Rϕk trained from human feedback; task prompts D; hyperparameters γ, λ, ϵ, β 2 1: policy model Pθ Pθinit, value model Vψ Vψinit 2: for step = 1, ..., M do 3: Sample a batch Db from D 4: Sample output sequence yn Pθ( | xn) for each prompt xn Db 5: Compute rewards {rn t }|yn| t=1 for each sampled output yn by running Rϕk Eq. 1 6: Compute advantages {At}|yn| t=1 and value targets {V targ(st)}|yn| t=1 for each yn with Vψ 7: for PPO iteration = 1, ..., µ do 8: Update the policy model by maximizing the PPO clipped surrogate objective: θ arg max θ 1 |Db| t=1 min Pθ(at | st) Pθold(at | st)At, clip(vt, 1 ε, 1 + ε)At 9: Update the value model by minimizing a square-error objective: ψ arg min ψ 1 |Db| Vψ(st) V targ(st) 2
Open Source Code Yes We release all data, collected human feedback, and codes at https://Fine Grained RLHF.github.io.
Open Datasets Yes With experiments on long-form QA, we aim to examine training models with fine-grained rewards at the two granularity dimensions (error category and density), for which we construct a long-form QA dataset, QA-FEEDBACK, along with our collected human feedback. [...] Overall, we have 3,853 training, 500 development, and 948 test examples (details in Appendix C). [...] QA-FEEDBACK is based on ASQA [39], a dataset that focuses on answering ambiguous factoid questions [26] in an open-domain setting.
Dataset Splits Yes Overall, we have 3,853 training, 500 development, and 948 test examples (details in Appendix C).
Hardware Specification Yes Regarding computation time, we use 2 80G NVIDIA A100 GPU for training, and the run time is about 22 hours.
Software Dependencies No No explicit version numbers for software dependencies like programming languages or libraries (e.g., PyTorch, TensorFlow, scikit-learn) are provided. The paper mentions tools like 'Adam optimizer', 'nucleus sampling decoding', 'PERSPECTIVE API [1]', and 'spaCy [15]' but without specific versions.
Experiment Setup Yes For training, we run 200K episodes. The batch size (number of episodes per card during training) is 64. We use Adam optimizer with a linear learning rate scheduler and 10 warmup steps. We perform a hyper-parameter grid-search for peak learning rate {5e 6, 1e 5, 2e 5}, KL coefficient β {0.1, 0.2, 0.3}, discounting factor λ {0.95, 0.97, 0.99}, and the frequency of exploration (number of sampled outputs) {2, 4, 8}.