reproducibilityindex.ai

Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

Authors: Chenlu Ye, Wei Xiong, Yuheng Zhang, Hanze Dong, Nan Jiang, Tong Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical studies verify the effectiveness of the proposed framework.
Researcher Affiliation	Collaboration	Chenlu Ye Wei Xiong Yuheng Zhang Hanze Dong Nan Jiang Tong Zhang University of Illinois Urbana-Champaign. Salesforce AI Research.
Pseudocode	Yes	Algorithm 1 Pessimistic Equilibrium Learning from Human Feedback, Algorithm 2 Optimistic Equilibrium Learning from Human Feedback with Enhancer
Open Source Code	Yes	We use the open-source project TRL12 to implement IPO and DPO. We have uploaded our codes.
Open Datasets	Yes	We use the Ultra-feedback [16] as our prompt set. We divide the prompt set into the train set (60K), validation set (1K), and test set (3K). We train the preference model on a diverse set of open-source preference datasets including HHRLHF [4], Stanford Human Preferences Dataset (SHP) [23], Ultra-feedback [16], Help Steer [66], distilabel-capybara14, distilabel-orca15, and Ultra Interact16.
Dataset Splits	Yes	We use the Ultra-feedback [16] as our prompt set. We divide the prompt set into the train set (60K), validation set (1K), and test set (3K).
Hardware Specification	Yes	All experiments are conducted on 8 A100-40G with Deepspeed Ze RO-3.
Software Dependencies	No	The paper mentions using 'axolotl' and 'TRL' packages but does not specify their version numbers or other software dependencies with version numbers.
Experiment Setup	Yes	We train the model for one epoch and use a batch size of 256, a learning rate of lr = 1e-5, and a cosine learning rate schedule with a warm-up ratio of 0.03. We search the hyper-parameter in {0.1, 0.5, 1.0} for IPO. Table 5: Hyper-parameters for reward modeling and preference model construction.