$\beta$-DPO: Direct Preference Optimization with Dynamic $\beta$
Authors: Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through empirical evaluation, we demonstrate that our dynamic β adjustment technique significantly improves DPO s performance across a range of models and datasets, offering a more robust and adaptable training paradigm for aligning LLMs with human feedback. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China 2Alibaba Group |
| Pseudocode | Yes | Algorithm 1 β-Direct Preference Optimization |
| Open Source Code | Yes | The code is available at https://github.com/junkangwu/beta-DPO. |
| Open Datasets | Yes | Datasets. We utilize the Anthropic HH dataset [3] for our experimental analysis... |
| Dataset Splits | No | The paper mentions using training and test datasets but does not explicitly provide specific percentages, counts, or file names for training, validation, or test splits, nor does it define a separate validation split. |
| Hardware Specification | Yes | We carried out all computational tasks on a suite of four 80GB A100 GPUs. |
| Software Dependencies | No | The paper mentions using the Adam optimizer but does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | Unless noted otherwise, we use a β = 0.1, batch size of 64, m = 0.9 to ensure the stability of the global Mi estimation, ρ = 0.8 to filter 20% uninformative samples, and the Adam optimizer with a learning rate 5e 7 by default. |