Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints
Authors: Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, Yuxin Chen
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, adopting these divergences ensures a balance between alignment performance and generation diversity. Importantly, f-DPO outperforms PPO-based methods in divergence efficiency, and divergence constraints directly influence expected calibration error (ECE). |
| Researcher Affiliation | Academia | Chaoqi Wang Yibo Jiang Chenghao Yang Han Liu Yuxin Chen Department of Computer Science, University of Chicago Correspondence to chaoqi@uchicago.edu |
| Pseudocode | Yes | Algorithm 1 Direct Preference Optimization with f-divergences (DPO-f) |
| Open Source Code | No | The paper refers to the original DPO implementation at 'https://github.com/eric-mitchell/direct-preference-optimization' but does not provide a link to or explicitly state the release of its own f-DPO source code. |
| Open Datasets | Yes | For the experiments, we adopt three datasets, including IMDB-sentiment dataset (Maas et al., 2011), Anthropic HH dataset (Bai et al., 2022a) and MT-bench (Zheng et al., 2023) for evaluation. |
| Dataset Splits | No | The paper mentions using test sets and refers to training configurations from prior work, but does not explicitly provide specific dataset split information (percentages, sample counts, or explicit instructions for creating the splits) for training, validation, and test sets needed for reproduction. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU models, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions software like 'trlx library', 'GPT-2large', 'SiEBERT model', and 'RoBERTa-large' but does not provide specific version numbers for these software components or libraries, which is necessary for reproducibility. |
| Experiment Setup | Yes | For PPO, we explored the divergence coefficient in {0.01, 0.03, 0.1, 0.3} for both PPO variants, each using ground-truth rewards. Our PPO implementation is based on the trlx library. Additionally, we adapted the official implementation of DPO with f-divergences from Rafailov et al. (2023), setting β at 0.1. ... To measure diversity, we generated 25 responses using nucleus sampling (Holtzman et al., 2020) with p = 0.95 for each prompt in the test set of the Anthropic HH dataset using temperatures of 0.6, 1.0, 1.4... In accordance with the official MT-Bench implementation (Zheng et al., 2023), we sampled responses with a temperature setting of 0.7 and limited the maximum number of newly generated tokens to 1024. |