Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
The Crucial Role of Samplers in Online Direct Preference Optimization
Authors: Ruizhe Shi, Runlong Zhou, Simon Du
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We analyze the convergence rates of DPO with various samplers under tabular softmax parametrization, and demonstrate theoretical advantages brought by specific samplers. Specifically, we show a separation that our proposed samplers, DPO-Mix-R and DPO-Mix-P, achieve quadratic convergence rates, while the commonly used one, DPO-Unif, can only achieve linear convergence rates. Numerical simulations support our results. See Section 4. Practical improvements. We design a new sampler for practical DPO. Specifically, we employ logit mixing to align sampling distribution to our theory. LM alignment experiments show that under the same computation budget, our method demonstrates significant advantages over baselines. On Safe-RLHF dataset, our method exhibits an over 7.4% improvement over vanilla DPO. On Iterative-Prompt dataset, our method shows a 5.4% improvement over vanilla DPO. See Section 5. Our results not only offer insights into the theoretical understanding of DPO but also pave the way for further algorithm designs. |
| Researcher Affiliation | Academia | Ruizhe Shi : Runlong Zhou ; Simon S. Du Equal contribution. :IIIS, Tsinghua University. Email: EMAIL. Part of the work was done while Ruizhe Shi was visiting the University of Washington. ;University of Washington. Email: EMAIL University of Washington. Email: EMAIL |
| Pseudocode | No | The paper defines mathematical equations for DPO and its update rules (e.g., Definition 1 Exact DPO with Equation 5) and describes sampler designs in prose, but it does not include any explicitly labeled pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Code released at this link. |
| Open Datasets | Yes | We conduct experiments on two datasets, Safe-RLHF (Ji et al., 2023a) and Iterative-Prompt (Xiong et al., 2024; Dong et al., 2024). For Safe-RLHF, we adopt a 10k subset of Ji et al. (2023a) (https://huggingface.co/ datasets/PKU-Alignment/PKU-Safe RLHF) for training, and a 2k subset as test set; For Iterative-Prompt, we adopt a 10k subset of Xiong et al. (2024); Dong et al. (2024) (RLHFlow/ iterative-prompt-v1-iter1-20K) for training, and a 2k subset as test set. |
| Dataset Splits | Yes | For Safe-RLHF, we adopt a 10k subset of Ji et al. (2023a)... for training, and a 2k subset as test set; For Iterative-Prompt, we adopt a 10k subset of Xiong et al. (2024); Dong et al. (2024)... for training, and a 2k subset as test set. |
| Hardware Specification | No | The paper mentions 'Due to restricted resources, we have not evaluated on open-benchmarks (Zheng et al., 2023; Dubois et al., 2024).' but does not provide any specific hardware details used for running the experiments. |
| Software Dependencies | No | The paper refers to other codebases used ('Our codebase is mainly based on the pipeline of Xiong et al. (2024); Dong et al. (2024) (https://github.com/RLHFlow/Online-RLHF), and has referred to Shi et al. (2024) (https://github.com/srzer/MOD) for the implementation of logit mixing.'), but it does not specify any particular software dependencies with version numbers for its own implementation (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | Implementation of mixed samplers and reward margin. Given the mixing ratio set as 1 α : α, for each prompt, we add a generated pair from ①with probability 1 α, and from ②with probability α. As we stated in Eq. (8), α can be approximated as expprq expp rq expprq expp rq 2. As for the reward margin rmax, unlike common practice as Xiong et al. (2024); Dong et al. (2024) setting rmax 8, we set rmax 4 for Safe-RLHF and rmax 1 for Iterative-Prompt, to better align with the assumed BT-model setting. Therefore, we use α 0.7 for the former and α 1 for the latter. We did not extensively tune these hyperparameters, as our focus has been on validation of theoretical claims. Hyperparameters. The hyperparameters are borrowed from Dong et al. (2024) with minimal modifications. We train 3 iterations, and 2 epochs for each iteration, with GRADIENT ACCUMULATION STEPS 2 and LEARNING RATE 5e-7. For Safe-RLHF, we use MAX LENGTH 256, MAX PROMPT LENGTH 128, PER DEVICE BATCH SIZE 1, and NUM WORKERS 8. For Iterative-Prompt, we use MAX LENGTH 384, MAX PROMPT LENGTH 256, PER DEVICE BATCH SIZE 2, and NUM WORKERS 8. During generation for training, we set temperature τ 0.7, while during evaluation we set τ 0.1. |