Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
What Matters in Data for DPO?
Authors: Yu Pan, Zhongze Cai, Huaiyang Zhong, Guanting Chen, Chonghuan Wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we provide a systematic study of how preference data distribution influences DPO, from both theoretical and empirical perspectives. ... Extensive experiments across diverse tasks confirm our findings: improving the quality of chosen responses consistently boosts performance regardless of the quality of the rejected responses. We also investigate the benefit of mixing the on-policy data. |
| Researcher Affiliation | Academia | 1 University of Sydney 2Imperial College London 3University of North Carolina at Chapel Hill 4 Virginia Tech 5 University of Texas at Dallas |
| Pseudocode | Yes | Algorithm 1 On-Policy Data Generation Process |
| Open Source Code | Yes | Our code is publicly available on Git Hub (What Matter4DPOData) and the data is hosted on Hugging Face. |
| Open Datasets | Yes | According to the data recipe of our base model, we select two public datasets, LAIONAI s Open Assistant 2 (Köpf et al., 2023) and Open BMB s Ultra Feedback (Cui et al., 2023), as the prompt datasets for our DPO training. |
| Dataset Splits | Yes | After filtering, the Open Assistant 2 and Ultra Feedback datasets contain 4,603 and 41,633 prompts, respectively. Besides the mentioned sources of responses, to ensure the abundance of the dataset for comparison, we also leverage the responses generated by the Mistral series model (Meng et al., 2024; Jiang et al., 2023). For each completion pair, i.e., a prompt and one of its responses, we use the Skywork-Reward-Gemma-2-27B-v0.2 (Liu et al., 2024) model as an oracle to assign quality scores. ... Concretely, each DPO pair is synthesized under two guiding principles: Fixed Chosen, Varied Rejected... Fixed Rejected, Varied Chosen. |
| Hardware Specification | Yes | All experiments were conducted on a server with 128 CPU cores, 1024 GB memory, 96 TB SSD storage and 8 NVIDIA H20 GPUs. |
| Software Dependencies | No | For all DPO experiments, we adopt the standard DPO training pipeline using the Huggingface framework with the following hyperparameters: Optimizer: Adam W (β1 = 0.9, β2 = 0.99) with no weight decay |
| Experiment Setup | Yes | For all DPO experiments, we adopt the standard DPO training pipeline using the Huggingface framework with the following hyperparameters: Optimizer: Adam W (β1 = 0.9, β2 = 0.99) with no weight decay Learning Rate: Linear warmup with ratio = 0.1 to a peak of 5 10 7, followed by cosine decay Batch Size: A global size of 32 via gradient accumulation over 4 steps Duration: 2 epochs DPO Beta: 0.1 Sequence Length: 2048 Precision: bfloat16 |