$\beta$-DPO: Direct Preference Optimization with Dynamic $\beta$

Authors: Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through empirical evaluation, we demonstrate that our dynamic β adjustment technique significantly improves DPO s performance across a range of models and datasets, offering a more robust and adaptable training paradigm for aligning LLMs with human feedback.
Researcher Affiliation Collaboration 1University of Science and Technology of China 2Alibaba Group
Pseudocode Yes Algorithm 1 β-Direct Preference Optimization
Open Source Code Yes The code is available at https://github.com/junkangwu/beta-DPO.
Open Datasets Yes Datasets. We utilize the Anthropic HH dataset [3] for our experimental analysis...
Dataset Splits No The paper mentions using training and test datasets but does not explicitly provide specific percentages, counts, or file names for training, validation, or test splits, nor does it define a separate validation split.
Hardware Specification Yes We carried out all computational tasks on a suite of four 80GB A100 GPUs.
Software Dependencies No The paper mentions using the Adam optimizer but does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the experiments.
Experiment Setup Yes Unless noted otherwise, we use a β = 0.1, batch size of 64, m = 0.9 to ensure the stability of the global Mi estimation, ρ = 0.8 to filter 20% uninformative samples, and the Adam optimizer with a learning rate 5e 7 by default.