Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Preference Optimization by Estimating the Ratio of the Data Distribution

Authors: Yeongmin Kim, HeeSun Bae, Byeonghu Na, Il-chul Moon

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments This section presents the empirical performance of the proposed BPO compared to prior preference optimization methods. Section 4.1 compares instances of BPO with other probabilistic loss extensions. Section 4.2 compares BPO to the state-of-the-art DPO loss variants on popular LLM benchmarks.
Researcher Affiliation Collaboration Yeongmin Kim1 Heesun Bae1 Byeonghu Na1 Il-Chul Moon1,2 1Korea Advanced Institute of Science and Technology (KAIST), 2summary.ai EMAIL
Pseudocode Yes Algorithm 1 describes the detailed algorithm for implementing the general BPO objective with any valid function h. The main difference from the original DPO lies in Line 5, where the loss is computed. Code 1 provides a Py Torch implementation of Lines 4 and 5.
Open Source Code Yes Project page: https://github.com/aailab-kaist/BPO. and Answer: [Yes] Justification: We provide the code with instructions in supplemental material.
Open Datasets Yes Task & experimental setup: The experiments are conducted for single-turn dialogue generation using the Anthropic helpful and harmless (HH) dataset [6], and summarization using the Reddit TL;DR dataset [59]. We conduct experiments using Mistral-7B-Base [28], Llama-3-8B-Base, and Llama-3-8B-Instruct [18] backbone models, based on the Ultra Feedback dataset [15]. Alpaca Eval2 [36] measures the win rate on 805 examples, using GPT-4 Turbo as both the judge and the opponent.
Dataset Splits Yes For dialogue generation, we use Pythia-2.8B [8] as the pre-trained LLM and perform SFT on the preferred subset of the HH dataset. For summarization, we use a publicly available SFT model [11] based on GPT-J [61]. All comparisons are conducted on the same SFT model with identical training hyperparameters (β, learning rate, batch size). See Appendix C for details. Model performance is evaluated on a held-out test set, using 100 randomly sampled examples. For the TL;DR summarization task, ... We perform preference optimization for one epoch on the 93k training subset of the comparison version of the TL;DR summarization dataset.
Hardware Specification Yes We used four 46GB NVIDIA L40S GPUs for the Pythia-2.8B experiments and four 80GB NVIDIA A100 GPUs for the other experiments, and all experiments were completed in less than 10 hours. ... We used a single NVIDIA L40S GPU for inference.
Software Dependencies Yes import torch in Code 1 and including alpaca-eval==0.6.2 and vllm==0.5.4. in Appendix C.2.
Experiment Setup Yes All experiments reported in Figure 1, Table 3, Table 4, Figure 4, and Figure 5 share the same default training configuration: β = 0.1, a batch size of 64, and the RMSprop optimizer with a learning rate of 5e-7. For the TL;DR summarization task, ... we set β to 0.5 for all methods... For BPO, we use SBA loss with λ = 0.5 and an Rθ clipping value of 0.025. For Mistral-7B-Base [28]6, ... For the DPO model ratio, we use a batch size of 64 and a learning rate of 8e-7. SBA loss is applied with λ = 0.5, and Rθ is clipped to be no smaller than 0.003.