Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization

Authors: Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, Alpha DPO achieves state-of-the-art performance on Alpaca Eval 2 (58.7% LC win rate) and Arena-Hard (35.7% win rate) across Mistral2-7B, Llama3-8B, and Gemma2-9B, demonstrating robust alignment without multi-stage training.
Researcher Affiliation Collaboration 1Mo E Key Lab of BIPC, University of Science and Technology of China 2Alibaba Group. Correspondence to: Xiang Wang <EMAIL>, Xiangnan He <EMAIL>.
Pseudocode No The paper describes methods in prose and mathematical formulations within the "Proposed Method: Alpha DPO" section, but does not contain a dedicated pseudocode block or algorithm listing.
Open Source Code Yes The code is available at https://github.com/junkangwu/alpha-DPO.
Open Datasets Yes For a fair comparison, we use the same training data as Sim PO: princeton-nlp/llama3-ultrafeedback-armorm1, princeton-nlp/mistral-instruct-ultrafeedback2, and princeton-nlp/gemma2-ultrafeedback-armorm 3 for Llama3-8B, Mistral2-7B, and Gemma2-9B, respectively. Additionally, the v0.2 Llama3-Instruct setup uses RLHFlow/Armo RM-Llama3-8B-v0.1 (Wang et al., 2024b)
Dataset Splits No The paper states, 'For a fair comparison, we use the same training data as Sim PO: princeton-nlp/llama3-ultrafeedback-armorm1, princeton-nlp/mistral-instruct-ultrafeedback2, and princeton-nlp/gemma2-ultrafeedback-armorm 3 for Llama3-8B, Mistral2-7B, and Gemma2-9B, respectively.' However, it does not explicitly provide details about training, validation, or test dataset splits for these datasets.
Hardware Specification Yes All training experiments presented in this paper were conducted using 8 A100 GPUs, as per the procedures detailed in the alignment-handbook repository.
Software Dependencies No The paper mentions 'Adam was used as the optimizer (Kingma & Ba, 2014)' but does not provide specific version numbers for any key software libraries, frameworks, or programming languages used in the implementation.
Experiment Setup Yes For other parameters, we used a consistent batch size of 128 across all methods. The learning rate was searched within the range of [3e-7, 5e-7, 8e-7, 1e-6], and all models were trained for a single epoch with a cosine learning rate schedule and a 10% warmup phase. Adam was used as the optimizer (Kingma & Ba, 2014). Additionally, the maximum sequence length was set to 2048. Table 4 outlines the hyperparameters used for Alpha DPO under various settings.