Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Authors: Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, Chelsea Finn

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
Researcher Affiliation Academia Rafael Rafailov Archit Sharma Eric Mitchell Stefano Ermon Christopher D. Manning Chelsea Finn Stanford University CZ Biohub {rafailov,architsh,eric.mitchell}@cs.stanford.edu
Pseudocode No The paper describes the DPO pipeline in Section 4, but it is presented as descriptive text rather than a structured pseudocode block or algorithm figure.
Open Source Code No The paper mentions links to other projects' models or frameworks (e.g., 'https://huggingface.co/Carper AI/openai_summarize_tldr_sft', 'https://huggingface.co/reciprocate/ppo_hh_pythia-6B', 'https://github.com/Carper AI/trlx/tree/main/examples/hh') which were used for comparison or as starting points. However, it does not provide a specific link or explicit statement about releasing the source code for the DPO methodology itself, implemented by the authors.
Open Datasets Yes In controlled sentiment generation, x is a prefix of a movie review from the IMDb dataset [22]... In summarization, x is a forum post from Reddit... we use the Reddit TL;DR summarization dataset [41]... in single-turn dialogue... we use the Anthropic Helpful and Harmless dialogue dataset [1]... Table 1: GPT-4 win rates vs. ground truth summaries for out-of-distribution CNN/Daily Mail input articles.
Dataset Splits No The paper mentions using 'train split of the IMDB dataset' and 'test split' for other datasets, but it does not consistently provide explicit details on training, validation, and test dataset splits with percentages, sample counts, or citations to predefined splits across all experiments in the main text.
Hardware Specification No The paper mentions that 'The Stanford Center for Research on Foundation Models (CRFM) provided part of the compute resources used for the experiments in this work' (Acknowledgements), but it does not specify any exact GPU/CPU models, processor types, or memory details.
Software Dependencies No The paper mentions models like 'GPT-J [43]', 'Pythia-2.8B [3]', and frameworks like 'TRLX [42]', but it does not provide specific version numbers for the ancillary software used in their implementation or experiments.
Experiment Setup Yes We execute multiple training runs for each algorithm, using a different hyperparameter for policy conservativeness in each run (target KL ∈ {3, 6, 9, 12} for PPO, β ∈ {0.05, 0.1, 1, 5}, α ∈ {0.05, 0.1, 0.5, 1} for unlikelihood, random seeds for preferred-FT).