Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Online Iterative Reinforcement Learning from Human Feedback with General Preference Model
Authors: Chenlu Ye, Wei Xiong, Yuheng Zhang, Hanze Dong, Nan Jiang, Tong Zhang
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical studies verify the effectiveness of the proposed framework. |
| Researcher Affiliation | Collaboration | Chenlu Ye Wei Xiong Yuheng Zhang Hanze Dong Nan Jiang Tong Zhang University of Illinois Urbana-Champaign. Salesforce AI Research. |
| Pseudocode | Yes | Algorithm 1 Pessimistic Equilibrium Learning from Human Feedback, Algorithm 2 Optimistic Equilibrium Learning from Human Feedback with Enhancer |
| Open Source Code | Yes | We use the open-source project TRL12 to implement IPO and DPO. We have uploaded our codes. |
| Open Datasets | Yes | We use the Ultra-feedback [16] as our prompt set. We divide the prompt set into the train set (60K), validation set (1K), and test set (3K). We train the preference model on a diverse set of open-source preference datasets including HHRLHF [4], Stanford Human Preferences Dataset (SHP) [23], Ultra-feedback [16], Help Steer [66], distilabel-capybara14, distilabel-orca15, and Ultra Interact16. |
| Dataset Splits | Yes | We use the Ultra-feedback [16] as our prompt set. We divide the prompt set into the train set (60K), validation set (1K), and test set (3K). |
| Hardware Specification | Yes | All experiments are conducted on 8 A100-40G with Deepspeed Ze RO-3. |
| Software Dependencies | No | The paper mentions using 'axolotl' and 'TRL' packages but does not specify their version numbers or other software dependencies with version numbers. |
| Experiment Setup | Yes | We train the model for one epoch and use a batch size of 256, a learning rate of lr = 1e-5, and a cosine learning rate schedule with a warm-up ratio of 0.03. We search the hyper-parameter in {0.1, 0.5, 1.0} for IPO. Table 5: Hyper-parameters for reward modeling and preference model construction. |