Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Crucial Role of Samplers in Online Direct Preference Optimization

Authors: Ruizhe Shi, Runlong Zhou, Simon Du

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We analyze the convergence rates of DPO with various samplers under tabular softmax parametrization, and demonstrate theoretical advantages brought by specific samplers. Specifically, we show a separation that our proposed samplers, DPO-Mix-R and DPO-Mix-P, achieve quadratic convergence rates, while the commonly used one, DPO-Unif, can only achieve linear convergence rates. Numerical simulations support our results. See Section 4. Practical improvements. We design a new sampler for practical DPO. Specifically, we employ logit mixing to align sampling distribution to our theory. LM alignment experiments show that under the same computation budget, our method demonstrates significant advantages over baselines. On Safe-RLHF dataset, our method exhibits an over 7.4% improvement over vanilla DPO. On Iterative-Prompt dataset, our method shows a 5.4% improvement over vanilla DPO. See Section 5. Our results not only offer insights into the theoretical understanding of DPO but also pave the way for further algorithm designs.
Researcher Affiliation	Academia	Ruizhe Shi : Runlong Zhou ; Simon S. Du Equal contribution. :IIIS, Tsinghua University. Email: EMAIL. Part of the work was done while Ruizhe Shi was visiting the University of Washington. ;University of Washington. Email: EMAIL University of Washington. Email: EMAIL
Pseudocode	No	The paper defines mathematical equations for DPO and its update rules (e.g., Definition 1 Exact DPO with Equation 5) and describes sampler designs in prose, but it does not include any explicitly labeled pseudocode blocks or algorithm listings.
Open Source Code	Yes	Code released at this link.
Open Datasets	Yes	We conduct experiments on two datasets, Safe-RLHF (Ji et al., 2023a) and Iterative-Prompt (Xiong et al., 2024; Dong et al., 2024). For Safe-RLHF, we adopt a 10k subset of Ji et al. (2023a) (https://huggingface.co/ datasets/PKU-Alignment/PKU-Safe RLHF) for training, and a 2k subset as test set; For Iterative-Prompt, we adopt a 10k subset of Xiong et al. (2024); Dong et al. (2024) (RLHFlow/ iterative-prompt-v1-iter1-20K) for training, and a 2k subset as test set.
Dataset Splits	Yes	For Safe-RLHF, we adopt a 10k subset of Ji et al. (2023a)... for training, and a 2k subset as test set; For Iterative-Prompt, we adopt a 10k subset of Xiong et al. (2024); Dong et al. (2024)... for training, and a 2k subset as test set.
Hardware Specification	No	The paper mentions 'Due to restricted resources, we have not evaluated on open-benchmarks (Zheng et al., 2023; Dubois et al., 2024).' but does not provide any specific hardware details used for running the experiments.
Software Dependencies	No	The paper refers to other codebases used ('Our codebase is mainly based on the pipeline of Xiong et al. (2024); Dong et al. (2024) (https://github.com/RLHFlow/Online-RLHF), and has referred to Shi et al. (2024) (https://github.com/srzer/MOD) for the implementation of logit mixing.'), but it does not specify any particular software dependencies with version numbers for its own implementation (e.g., Python, PyTorch versions).
Experiment Setup	Yes	Implementation of mixed samplers and reward margin. Given the mixing ratio set as 1 α : α, for each prompt, we add a generated pair from ①with probability 1 α, and from ②with probability α. As we stated in Eq. (8), α can be approximated as expprq expp rq expprq expp rq 2. As for the reward margin rmax, unlike common practice as Xiong et al. (2024); Dong et al. (2024) setting rmax 8, we set rmax 4 for Safe-RLHF and rmax 1 for Iterative-Prompt, to better align with the assumed BT-model setting. Therefore, we use α 0.7 for the former and α 1 for the latter. We did not extensively tune these hyperparameters, as our focus has been on validation of theoretical claims. Hyperparameters. The hyperparameters are borrowed from Dong et al. (2024) with minimal modifications. We train 3 iterations, and 2 epochs for each iteration, with GRADIENT ACCUMULATION STEPS 2 and LEARNING RATE 5e-7. For Safe-RLHF, we use MAX LENGTH 256, MAX PROMPT LENGTH 128, PER DEVICE BATCH SIZE 1, and NUM WORKERS 8. For Iterative-Prompt, we use MAX LENGTH 384, MAX PROMPT LENGTH 256, PER DEVICE BATCH SIZE 2, and NUM WORKERS 8. During generation for training, we set temperature τ 0.7, while during evaluation we set τ 0.1.