Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

What Matters in Data for DPO?

Authors: Yu Pan, Zhongze Cai, Huaiyang Zhong, Guanting Chen, Chonghuan Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we provide a systematic study of how preference data distribution inﬂuences DPO, from both theoretical and empirical perspectives. ... Extensive experiments across diverse tasks conﬁrm our ﬁndings: improving the quality of chosen responses consistently boosts performance regardless of the quality of the rejected responses. We also investigate the beneﬁt of mixing the on-policy data.
Researcher Affiliation	Academia	1 University of Sydney 2Imperial College London 3University of North Carolina at Chapel Hill 4 Virginia Tech 5 University of Texas at Dallas
Pseudocode	Yes	Algorithm 1 On-Policy Data Generation Process
Open Source Code	Yes	Our code is publicly available on Git Hub (What Matter4DPOData) and the data is hosted on Hugging Face.
Open Datasets	Yes	According to the data recipe of our base model, we select two public datasets, LAIONAI s Open Assistant 2 (Köpf et al., 2023) and Open BMB s Ultra Feedback (Cui et al., 2023), as the prompt datasets for our DPO training.
Dataset Splits	Yes	After ﬁltering, the Open Assistant 2 and Ultra Feedback datasets contain 4,603 and 41,633 prompts, respectively. Besides the mentioned sources of responses, to ensure the abundance of the dataset for comparison, we also leverage the responses generated by the Mistral series model (Meng et al., 2024; Jiang et al., 2023). For each completion pair, i.e., a prompt and one of its responses, we use the Skywork-Reward-Gemma-2-27B-v0.2 (Liu et al., 2024) model as an oracle to assign quality scores. ... Concretely, each DPO pair is synthesized under two guiding principles: Fixed Chosen, Varied Rejected... Fixed Rejected, Varied Chosen.
Hardware Specification	Yes	All experiments were conducted on a server with 128 CPU cores, 1024 GB memory, 96 TB SSD storage and 8 NVIDIA H20 GPUs.
Software Dependencies	No	For all DPO experiments, we adopt the standard DPO training pipeline using the Huggingface framework with the following hyperparameters: Optimizer: Adam W (β1 = 0.9, β2 = 0.99) with no weight decay
Experiment Setup	Yes	For all DPO experiments, we adopt the standard DPO training pipeline using the Huggingface framework with the following hyperparameters: Optimizer: Adam W (β1 = 0.9, β2 = 0.99) with no weight decay Learning Rate: Linear warmup with ratio = 0.1 to a peak of 5 10 7, followed by cosine decay Batch Size: A global size of 32 via gradient accumulation over 4 steps Duration: 2 epochs DPO Beta: 0.1 Sequence Length: 2048 Precision: bﬂoat16