Adaptive Preference Scaling for Reinforcement Learning with Human Feedback

Authors: Ilgee Hong, Zichong Li, Alexander Bukharin, Yixiao Li, Haoming Jiang, Tianbao Yang, Tuo Zhao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments with robotic control and natural language generation with large language models (LLMs) show that our method not only improves policy performance but also aligns reward function selection more closely with policy optimization, simplifying the hyperparameter tuning process.
Researcher Affiliation Collaboration Ilgee Hong Georgia Institute of Technology ihong39@gatech.edu Zichong Li Georgia Institute of Technology zli911@gatech.edu Alexander Bukharin Georgia Institute of Technology abukharin3@gatech.edu Yixiao Li Georgia Institute of Technology yixiaoli@gatech.edu Haoming Jiang Amazon jhaoming@amazon.com Tianbao Yang Texas A&M University tianbao-yang@tamu.edu Tuo Zhao Georgia Institute of Technology tourzhao@gatech.edu
Pseudocode Yes Algorithm 1 Algorithm for reward learning with adaptive preference scaling 1: Input: τ0, τmax, ρ, ηϕ; 2: for m = 0, 1, 2, . . . , M 1 do 3: Sample a pair of trajectory segements from Dpref; 4: Set τ 0 i = 1; 5: for k = 0, 1, 2, . . . , K 1 do 6: Compute (k) i using (9) and update τ (k) i using (8); 7: end for 8: Update ϕ(m) using (10) or Adam-style step; 9: end for
Open Source Code No Justification: We will release the code after the submission deadline.
Open Datasets Yes We apply our proposed reward learning method on 3 robotic control tasks from the Py Bullet [13] environments: Half Cheetah, Ant, and Hopper. These environments are similar to those available in Open AI Gym [9] but they are known to be much harder to solve [34]. We use the filtered TL;DR summarization dataset [41] for instruction tuning, which contains more than 117K Reddit posts, each with a human-written summary. We utilize the Anthropic Helpful and Harmless dialogue preferences dataset [3] for both instruction tuning and preference optimization.
Dataset Splits Yes We also split a subset from each preference optimization dataset to validate the preference prediction accuracy.
Hardware Specification Yes We conducted our experiments using eight A100 GPUs, each with 40GB of memory. Training a single model took approximately two hours.
Software Dependencies No The paper mentions software like Stable-Baselines3, RL Zoo, transformers, and trl training framework but does not provide specific version numbers for these components.
Experiment Setup Yes For both Ada-Pref and Pref, we set the segment length to 1... We set the batch size to 64 for the Half Cheetah and Ant tasks and 4 for the Hopper task. We tune the number of epochs in {1, 3, 5}. We use Adam optimizer [22] and tune the learning rate in {5e 3, 1e 3, 5e 4, 1e 4} for the Ant and Half Cheetah, and set the learning rate to 1e 2 for the Hopper. For Ada-Pref, we tune the τmax in {1.0, 3.0} and the ρ0 in {0.1, 0.3, 0.5}. We fix τ0 = 0.1 and the number of Newton iterations to 3 for all experiments. Details of the chosen hyperparameters for reward learning for all three tasks are summarized in Tables 3 and 4.