Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

KL Penalty Control via Perturbation for Direct Preference Optimization

Authors: Sangkyu Lee, Janghoon Han, Hosung Song, Stanley Jungkyu Choi, Honglak Lee, Youngjae Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that the instance-level adaptive criterion of ε-DPO remarkably improves DPO, better than β-DPO and TR-DPO, to outperform most direct alignment algorithms that modify the DPO objective function [43, 45, 3, 42, 12, 18, 31, 29]. In this section, we conduct experiments to validate the ε-DPO. We mainly check the feasibility of ε-DPO for general chatbot alignment using Ultra Feedback [9], compared to the direct alignment algorithms [32, 43, 45, 3, 42, 12, 18, 31, 29].
Researcher Affiliation Collaboration Sangkyu Lee1, Janghoon Han2 Hosung Song2 Stanley Jungkyu Choi2 Honglak Lee2,3 Youngjae Yu4 Yonsei University1 LG AI Research2 University of Michigan, Ann Arbor3 Seoul National University4 EMAIL EMAIL
Pseudocode Yes Algorithm 1 ε-Direct Preference Optimization Require: policy πθ, reference policy πref, initial KL penalty coefficient β, and perturbation size ε 1: while not converged do 2: Sample training batch of preference triplet (x, yw, yl) D. 3: Estimate the policies under the perturbation πˆθ(β ε ) and πˆθ(β+ ε ) according to 3 and 4. 4: Determine instance-level KL penalty coefficients β(x, yw, yl; θ) according to 5. 5: Update πθ by LDPO with β(x, yw, yl; θ) and then β Ex,yw,yl[ β(x, yw, yl; θ)]. 6: end while 7: return aligned policy πθ.
Open Source Code Yes 1The code is available at github.com/oddqueue/e-dpo.
Open Datasets Yes Ultra Feedback [9] is an AI feedback dataset where GPT-4 [1] rates responses obtained from four different language models. Anthropic-HH [4] is a human preference dialogue dataset containing two subsets based on the helpfulness and harmlessness principle. We use corresponding datasets publicly released by Sim PO, each denoted as mistral-instruct-ultrafeedback5 and llama3-ultrafeedback6. 5huggingface.co/datasets/princeton-nlp/mistral-instruct-ultrafeedback, MIT License 6huggingface.co/datasets/princeton-nlp/llama3-ultrafeedback, MIT License
Dataset Splits Yes Here, we use helpful-base and harmless-base splits to validate the criterion using logit monotonicity for instance-level β control used in ε-DPO and the efficiency in terms of trade-off between performance and KL divergence [33]. We regard Pair RM [21] as an external evaluator for checking performance by win rate, comparing their responses and chosen responses in the test splits.
Hardware Specification Yes Every experiment is conducted using 16 NVIDIA A100-SXM4-40GB GPUs within 2 hours. Every experiment is conducted using 4 NVIDIA A100-SXM4-40GB GPUs within 7 hours.
Software Dependencies No The implementation of ε-DPO and experiments are all based on the TRL2 library. (Footnote 2: github.com/huggingface/trl, Apache 2.0 License) - This mentions the library but not a specific version number for TRL or other core dependencies.
Experiment Setup Yes Table 6: Training configurations for Mistral-Instruct and Llama-3-Instruct using Ultrafeedback [9]. The underline indicates the best value selected through hyperparameter search. Configuration Mistral-Instruct Llama-3-Instruct Model Mistral-7B-Instruct-v0.2 Meta-Llama-3-8B-Instruct Dataset mistral-instruct-ultrafeedback llama3-ultrafeedback Optimizer Adam W Adam W Epoch 1 1 Batch Size 128 128 Learning Rate [3e-7, 5e-7, 7e-7, 1e-6] [3e-7, 5e-7, 7e-7, 1e-6] Scheduler cosine cosine Warm-up Ratio 0.1 0.1 Weight Decay 0 0 β 0.01 0.01 ε [0.005, 0.01, 0.02] [0.005, 0.01, 0.02]