reproducibilityindex.ai

Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints

Authors: Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, Yuxin Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, adopting these divergences ensures a balance between alignment performance and generation diversity. Importantly, f-DPO outperforms PPO-based methods in divergence efficiency, and divergence constraints directly influence expected calibration error (ECE).
Researcher Affiliation	Academia	Chaoqi Wang Yibo Jiang Chenghao Yang Han Liu Yuxin Chen Department of Computer Science, University of Chicago Correspondence to chaoqi@uchicago.edu
Pseudocode	Yes	Algorithm 1 Direct Preference Optimization with f-divergences (DPO-f)
Open Source Code	No	The paper refers to the original DPO implementation at 'https://github.com/eric-mitchell/direct-preference-optimization' but does not provide a link to or explicitly state the release of its own f-DPO source code.
Open Datasets	Yes	For the experiments, we adopt three datasets, including IMDB-sentiment dataset (Maas et al., 2011), Anthropic HH dataset (Bai et al., 2022a) and MT-bench (Zheng et al., 2023) for evaluation.
Dataset Splits	No	The paper mentions using test sets and refers to training configurations from prior work, but does not explicitly provide specific dataset split information (percentages, sample counts, or explicit instructions for creating the splits) for training, validation, and test sets needed for reproduction.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU models, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions software like 'trlx library', 'GPT-2large', 'SiEBERT model', and 'RoBERTa-large' but does not provide specific version numbers for these software components or libraries, which is necessary for reproducibility.
Experiment Setup	Yes	For PPO, we explored the divergence coefficient in {0.01, 0.03, 0.1, 0.3} for both PPO variants, each using ground-truth rewards. Our PPO implementation is based on the trlx library. Additionally, we adapted the official implementation of DPO with f-divergences from Rafailov et al. (2023), setting β at 0.1. ... To measure diversity, we generated 25 responses using nucleus sampling (Holtzman et al., 2020) with p = 0.95 for each prompt in the test set of the Anthropic HH dataset using temperatures of 0.6, 1.0, 1.4... In accordance with the official MT-Bench implementation (Zheng et al., 2023), we sampled responses with a temperature setting of 0.7 and limited the maximum number of newly generated tokens to 1024.