Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment
Authors: Teng Xiao, Yige Yuan, Huaisheng Zhu, Mingxiao Li, Vasant Honavar
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The results of our experiments on a variety of standard benchmarks show that Cal-DPO remarkably improves off-the-shelf methods. |
| Researcher Affiliation | Collaboration | Teng Xiao1, Yige Yuan2, Huaisheng Zhu1, Mingxiao Li3, Vasant G Honavar1 1Artificial Intelligence Research Laboratory, Pennsylvania State University 2University of Chinese Academy of Sciences, 3Tencent AI Lab {tengxiao,hvz5312,vhonavar}@psu.edu yuanyige923@gmail.com, mingxiaoli@tencent.com |
| Pseudocode | Yes | Algorithm 1: A Pytorch-style Pseudocode of Cal-DPO |
| Open Source Code | Yes | Code is available at https://github.com/tengxiao1/Cal-DPO. |
| Open Datasets | Yes | We evaluate Cal-DPO on four widely used datasets for preference fine-tuning: the Ultra Feedback Binarized dataset [53, 54], Reddit TL;DR summarization dataset [14], Anthropic-HH dataset [1], and the IMDb sentiment dataset [13]. https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized, https://huggingface.co/datasets/Anthropic/hh-rlhf, https://huggingface.co/datasets/openai/summarize_from_feedback, https://huggingface.co/datasets/stanfordnlp/imdb |
| Dataset Splits | Yes | IMDb Sentiment [13]: This dataset 4 contains movie reviews from the IMDb with positive and negative sentiment, which contains 25k training samples and each 5k samples for validation and test. |
| Hardware Specification | Yes | The experiments on are run on 4 Nvidia A100 GPUs with BF16 precision. |
| Software Dependencies | No | The paper mentions 'Pytorch-style Pseudocode' but does not specify version numbers for PyTorch or any other software libraries or dependencies. |
| Experiment Setup | Yes | The β of Cal-DPO is searched from [1e-3, 2e-3, 3e-3, 1e-2, 1e-1], the batch size for all methods is 128, and we use the RMSprop optimizer with a learning rate of 5e-6. We linearly warm up the learning rate from 0 to 5e-6 in 150 steps. The sampling temperature is set to 1 for all experiments. |