Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On Extending Direct Preference Optimization to Accommodate Ties
Authors: Jinghong Chen, Guangyu Yang, Weizhe Lin, Jingbiao Mei, Chenxu Lyu, Bill Byrne
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments in neural machine translation and summarization show that explicitly labeled ties can be added to the datasets for these DPO variants without the degradation in task performance that is observed when the same tied pairs are presented to DPO. We find empirically that the inclusion of ties leads to stronger regularization with respect to the reference policy as measured by KL divergence, and we see this even for DPO in its original form. We provide a theoretical explanation for this regularization effect using ideal DPO policy theory. We further show performance improvements over DPO in translation and mathematical reasoning using our DPO variants. |
| Researcher Affiliation | Academia | Jinghong Chen, Guangyu Yang, Weizhe Lin, Jingbiao Mei, Chenxu Lyu, Bill Byrne Department of Engineering University of Cambridge Cambridge, United Kingdom CB2 1PZ EMAIL |
| Pseudocode | No | The paper describes the methodology in prose and mathematical derivations (e.g., Section 2 Methodology, Appendix B Mathematical Derivations) without presenting structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes are available at https://github.com/EriChen0615/DPO-RKD. |
| Open Datasets | Yes | For Neural Machine Translation (NMT) on WMT-21 ZH-EN [Akhbardeh et al., 2021] and IWSLT-17 FR-EN [Cettolo et al., 2017], we rank responses by BLEURT, a widely-used reference-based quality metric [Sellam et al., 2020, Freitag et al., 2023]. For Summarization on TL;DR [Stiennon et al., 2020], we rank responses using the implicit reward function learned by DPO itself, without an external reward model. Appendix E gives experiment details. Studies of these selection strategies can be found in Appendix F.7 and F.8. |
| Dataset Splits | Yes | For each of the three tasks we form two training sets: CP, which contains the Clear Preference Pairs; and CP+TP, which contains both the Clear Preference Pairs and the Tied Pairs. We refer to DPO training on these sets as DPO(CP) and DPO(CP+TP). ... For each source sentence, the translations are ranked by their BLEURT scores computed with respect to the reference translations. The highest and lowest scoring translations form the Clear Preference Pairs; for each source sentence, these are the two translations with the greatest difference in BLEURT score. By contrast, we take the Tied Pairs as the two non-identical translations with the minimum absolute BLEURT difference; the translation with higher BLEURT is labeled as the winner of each Tied Pair. This yields ca. 16K CPs and TPs for use in DPO. The same procedure is applied to the IWSLT17 validation set, yielding ca. 800 CPs and TPs for use as DPO preference datasets. |
| Hardware Specification | Yes | All NMT experiments are run on two Nvidia A100-80G GPUs with an effective batch size of 4. ... All summarization experiments are run on two Nvidia A100-40G GPUs with an effective batch size of 64. |
| Software Dependencies | No | The paper mentions software components implicitly (e.g., using RMSProp optimizer, which is part of ML frameworks like PyTorch), but it does not specify versions for any programming languages or libraries. |
| Experiment Setup | Yes | We use the RMSProp optimizer with the learning rate set to 5e 7 and the number of warm-up steps set to 150. All NMT experiments are run on two Nvidia A100-80G GPUs with an effective batch size of 4. We used FP32 for training the policy. The log-probabilities from the reference model are pre-computed with FP32 precision. |