Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

What Matters in Data for DPO?

Authors: Yu Pan, Zhongze Cai, Huaiyang Zhong, Guanting Chen, Chonghuan Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we provide a systematic study of how preference data distribution influences DPO, from both theoretical and empirical perspectives. ... Extensive experiments across diverse tasks confirm our findings: improving the quality of chosen responses consistently boosts performance regardless of the quality of the rejected responses. We also investigate the benefit of mixing the on-policy data.
Researcher Affiliation Academia 1 University of Sydney 2Imperial College London 3University of North Carolina at Chapel Hill 4 Virginia Tech 5 University of Texas at Dallas
Pseudocode Yes Algorithm 1 On-Policy Data Generation Process
Open Source Code Yes Our code is publicly available on Git Hub (What Matter4DPOData) and the data is hosted on Hugging Face.
Open Datasets Yes According to the data recipe of our base model, we select two public datasets, LAIONAI s Open Assistant 2 (Köpf et al., 2023) and Open BMB s Ultra Feedback (Cui et al., 2023), as the prompt datasets for our DPO training.
Dataset Splits Yes After filtering, the Open Assistant 2 and Ultra Feedback datasets contain 4,603 and 41,633 prompts, respectively. Besides the mentioned sources of responses, to ensure the abundance of the dataset for comparison, we also leverage the responses generated by the Mistral series model (Meng et al., 2024; Jiang et al., 2023). For each completion pair, i.e., a prompt and one of its responses, we use the Skywork-Reward-Gemma-2-27B-v0.2 (Liu et al., 2024) model as an oracle to assign quality scores. ... Concretely, each DPO pair is synthesized under two guiding principles: Fixed Chosen, Varied Rejected... Fixed Rejected, Varied Chosen.
Hardware Specification Yes All experiments were conducted on a server with 128 CPU cores, 1024 GB memory, 96 TB SSD storage and 8 NVIDIA H20 GPUs.
Software Dependencies No For all DPO experiments, we adopt the standard DPO training pipeline using the Huggingface framework with the following hyperparameters: Optimizer: Adam W (β1 = 0.9, β2 = 0.99) with no weight decay
Experiment Setup Yes For all DPO experiments, we adopt the standard DPO training pipeline using the Huggingface framework with the following hyperparameters: Optimizer: Adam W (β1 = 0.9, β2 = 0.99) with no weight decay Learning Rate: Linear warmup with ratio = 0.1 to a peak of 5 10 7, followed by cosine decay Batch Size: A global size of 32 via gradient accumulation over 4 steps Duration: 2 epochs DPO Beta: 0.1 Sequence Length: 2048 Precision: bfloat16