Online Iterative Reinforcement Learning from Human Feedback with General Preference Model
Authors: Chenlu Ye, Wei Xiong, Yuheng Zhang, Hanze Dong, Nan Jiang, Tong Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical studies verify the effectiveness of the proposed framework. |
| Researcher Affiliation | Collaboration | Chenlu Ye Wei Xiong Yuheng Zhang Hanze Dong Nan Jiang Tong Zhang University of Illinois Urbana-Champaign. Salesforce AI Research. |
| Pseudocode | Yes | Algorithm 1 Pessimistic Equilibrium Learning from Human Feedback, Algorithm 2 Optimistic Equilibrium Learning from Human Feedback with Enhancer |
| Open Source Code | Yes | We use the open-source project TRL12 to implement IPO and DPO. We have uploaded our codes. |
| Open Datasets | Yes | We use the Ultra-feedback [16] as our prompt set. We divide the prompt set into the train set (60K), validation set (1K), and test set (3K). We train the preference model on a diverse set of open-source preference datasets including HHRLHF [4], Stanford Human Preferences Dataset (SHP) [23], Ultra-feedback [16], Help Steer [66], distilabel-capybara14, distilabel-orca15, and Ultra Interact16. |
| Dataset Splits | Yes | We use the Ultra-feedback [16] as our prompt set. We divide the prompt set into the train set (60K), validation set (1K), and test set (3K). |
| Hardware Specification | Yes | All experiments are conducted on 8 A100-40G with Deepspeed Ze RO-3. |
| Software Dependencies | No | The paper mentions using 'axolotl' and 'TRL' packages but does not specify their version numbers or other software dependencies with version numbers. |
| Experiment Setup | Yes | We train the model for one epoch and use a batch size of 256, a learning rate of lr = 1e-5, and a cosine learning rate schedule with a warm-up ratio of 0.03. We search the hyper-parameter in {0.1, 0.5, 1.0} for IPO. Table 5: Hyper-parameters for reward modeling and preference model construction. |