Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

Authors: Chenlu Ye, Wei Xiong, Yuheng Zhang, Hanze Dong, Nan Jiang, Tong Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical studies verify the effectiveness of the proposed framework.
Researcher Affiliation Collaboration Chenlu Ye Wei Xiong Yuheng Zhang Hanze Dong Nan Jiang Tong Zhang University of Illinois Urbana-Champaign. Salesforce AI Research.
Pseudocode Yes Algorithm 1 Pessimistic Equilibrium Learning from Human Feedback, Algorithm 2 Optimistic Equilibrium Learning from Human Feedback with Enhancer
Open Source Code Yes We use the open-source project TRL12 to implement IPO and DPO. We have uploaded our codes.
Open Datasets Yes We use the Ultra-feedback [16] as our prompt set. We divide the prompt set into the train set (60K), validation set (1K), and test set (3K). We train the preference model on a diverse set of open-source preference datasets including HHRLHF [4], Stanford Human Preferences Dataset (SHP) [23], Ultra-feedback [16], Help Steer [66], distilabel-capybara14, distilabel-orca15, and Ultra Interact16.
Dataset Splits Yes We use the Ultra-feedback [16] as our prompt set. We divide the prompt set into the train set (60K), validation set (1K), and test set (3K).
Hardware Specification Yes All experiments are conducted on 8 A100-40G with Deepspeed Ze RO-3.
Software Dependencies No The paper mentions using 'axolotl' and 'TRL' packages but does not specify their version numbers or other software dependencies with version numbers.
Experiment Setup Yes We train the model for one epoch and use a batch size of 256, a learning rate of lr = 1e-5, and a cosine learning rate schedule with a warm-up ratio of 0.03. We search the hyper-parameter in {0.1, 0.5, 1.0} for IPO. Table 5: Hyper-parameters for reward modeling and preference model construction.