Token-level Direct Preference Optimization

Authors: Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, Jun Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we demonstrate the superior performance of our algorithm in three different open-sourced datasets: the IMDb sentiment dataset (Maas et al., 2011), the Anthropic HH dataset (Bai et al., 2022), and MT-bench (Zheng et al., 2023).
Researcher Affiliation Collaboration 1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Microsoft Research AI4Science 4University College London.
Pseudocode Yes Algorithm 1 Token-level Direct Preference Optimization (TDPO)
Open Source Code Yes Our code is opensourced at https://github.com/Vance0124/Tokenlevel-Direct-Preference-Optimization.
Open Datasets Yes In this section, we demonstrate the superior performance of our algorithm in three different open-sourced datasets: the IMDb sentiment dataset (Maas et al., 2011), the Anthropic HH dataset (Bai et al., 2022), and MT-bench (Zheng et al., 2023).
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits with percentages or sample counts for the datasets used in the experiments.
Hardware Specification No The paper does not explicitly provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No Appendix B provides PyTorch code snippets using 'import torch' and 'torch.nn.functional', but does not specify version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes Unless specified otherwise, we use a α = 0.5, β = 0.1, batch size of 64, and the RMSprop optimizer with a learning rate of 5e-6. We linearly warm up the learning rate from 0 to 5e-6 over 150 steps.