Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Authors: Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, Chelsea Finn
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. |
| Researcher Affiliation | Academia | Rafael Rafailov Archit Sharma Eric Mitchell Stefano Ermon Christopher D. Manning Chelsea Finn Stanford University CZ Biohub {rafailov,architsh,eric.mitchell}@cs.stanford.edu |
| Pseudocode | No | The paper describes the DPO pipeline in Section 4, but it is presented as descriptive text rather than a structured pseudocode block or algorithm figure. |
| Open Source Code | No | The paper mentions links to other projects' models or frameworks (e.g., 'https://huggingface.co/Carper AI/openai_summarize_tldr_sft', 'https://huggingface.co/reciprocate/ppo_hh_pythia-6B', 'https://github.com/Carper AI/trlx/tree/main/examples/hh') which were used for comparison or as starting points. However, it does not provide a specific link or explicit statement about releasing the source code for the DPO methodology itself, implemented by the authors. |
| Open Datasets | Yes | In controlled sentiment generation, x is a prefix of a movie review from the IMDb dataset [22]... In summarization, x is a forum post from Reddit... we use the Reddit TL;DR summarization dataset [41]... in single-turn dialogue... we use the Anthropic Helpful and Harmless dialogue dataset [1]... Table 1: GPT-4 win rates vs. ground truth summaries for out-of-distribution CNN/Daily Mail input articles. |
| Dataset Splits | No | The paper mentions using 'train split of the IMDB dataset' and 'test split' for other datasets, but it does not consistently provide explicit details on training, validation, and test dataset splits with percentages, sample counts, or citations to predefined splits across all experiments in the main text. |
| Hardware Specification | No | The paper mentions that 'The Stanford Center for Research on Foundation Models (CRFM) provided part of the compute resources used for the experiments in this work' (Acknowledgements), but it does not specify any exact GPU/CPU models, processor types, or memory details. |
| Software Dependencies | No | The paper mentions models like 'GPT-J [43]', 'Pythia-2.8B [3]', and frameworks like 'TRLX [42]', but it does not provide specific version numbers for the ancillary software used in their implementation or experiments. |
| Experiment Setup | Yes | We execute multiple training runs for each algorithm, using a different hyperparameter for policy conservativeness in each run (target KL ∈ {3, 6, 9, 12} for PPO, β ∈ {0.05, 0.1, 1, 5}, α ∈ {0.05, 0.1, 0.5, 1} for unlikelihood, random seeds for preferred-FT). |