reproducibilityindex.ai

$\beta$-DPO: Direct Preference Optimization with Dynamic $\beta$

Authors: Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through empirical evaluation, we demonstrate that our dynamic β adjustment technique significantly improves DPO s performance across a range of models and datasets, offering a more robust and adaptable training paradigm for aligning LLMs with human feedback.
Researcher Affiliation	Collaboration	1University of Science and Technology of China 2Alibaba Group
Pseudocode	Yes	Algorithm 1 β-Direct Preference Optimization
Open Source Code	Yes	The code is available at https://github.com/junkangwu/beta-DPO.
Open Datasets	Yes	Datasets. We utilize the Anthropic HH dataset [3] for our experimental analysis...
Dataset Splits	No	The paper mentions using training and test datasets but does not explicitly provide specific percentages, counts, or file names for training, validation, or test splits, nor does it define a separate validation split.
Hardware Specification	Yes	We carried out all computational tasks on a suite of four 80GB A100 GPUs.
Software Dependencies	No	The paper mentions using the Adam optimizer but does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the experiments.
Experiment Setup	Yes	Unless noted otherwise, we use a β = 0.1, batch size of 64, m = 0.9 to ensure the stability of the global Mi estimation, ρ = 0.8 to filter 20% uninformative samples, and the Adam optimizer with a learning rate 5e 7 by default.