reproducibilityindex.ai

Token-level Direct Preference Optimization

Authors: Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, Jun Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we demonstrate the superior performance of our algorithm in three different open-sourced datasets: the IMDb sentiment dataset (Maas et al., 2011), the Anthropic HH dataset (Bai et al., 2022), and MT-bench (Zheng et al., 2023).
Researcher Affiliation	Collaboration	1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Microsoft Research AI4Science 4University College London.
Pseudocode	Yes	Algorithm 1 Token-level Direct Preference Optimization (TDPO)
Open Source Code	Yes	Our code is opensourced at https://github.com/Vance0124/Tokenlevel-Direct-Preference-Optimization.
Open Datasets	Yes	In this section, we demonstrate the superior performance of our algorithm in three different open-sourced datasets: the IMDb sentiment dataset (Maas et al., 2011), the Anthropic HH dataset (Bai et al., 2022), and MT-bench (Zheng et al., 2023).
Dataset Splits	No	The paper does not explicitly provide training/validation/test dataset splits with percentages or sample counts for the datasets used in the experiments.
Hardware Specification	No	The paper does not explicitly provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies	No	Appendix B provides PyTorch code snippets using 'import torch' and 'torch.nn.functional', but does not specify version numbers for PyTorch or any other software dependencies.
Experiment Setup	Yes	Unless specified otherwise, we use a α = 0.5, β = 0.1, batch size of 64, and the RMSprop optimizer with a learning rate of 5e-6. We linearly warm up the learning rate from 0 to 5e-6 over 150 steps.