Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Improving LLM General Preference Alignment via Optimistic Online Mirror Descent

Authors: Yuheng Zhang, Dian Yu, Tao Ge, Linfeng Song, Zhichen Zeng, Haitao Mi, Nan Jiang, Dong Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental More importantly, we implement our method and show through experiments that it outperforms state-of-the-art RLHF algorithms across multiple representative benchmarks. (...) 5 Experiments
Researcher Affiliation Collaboration Yuheng Zhang UIUC Dian Yu Tencent AI Lab Tao Ge Tencent AI Lab Linfeng Song Tencent AI Lab Zhichen Zeng UIUC Haitao Mi Tencent AI Lab Nan Jiang UIUC Dong Yu Tencent AI Lab
Pseudocode Yes Algorithm 1 Implementation of ONPO
Open Source Code Yes We have uploaded our codes.
Open Datasets Yes For the general preference oracle, we use a pairwise preference model4, which demonstrates better performance compared to the BT reward model [Zhang et al., 2024]. Training details for the preference model are available in Dong et al. [2024]. 4https://huggingface.co/RLHFlow/pair-preference-model-LLa MA3-8B We evaluate the models on three representative benchmarks: Alpaca Eval 2.0 [Li et al., 2023a], Arena Hard [Li et al., 2024] and MT-Bench [Zheng et al., 2024]. 5https://huggingface.co/datasets/RLHFlow/prompt-collection-v0.1
Dataset Splits No The paper describes an online data generation process for training and then evaluates on specific benchmarks. It does not provide explicit training/validation/test dataset splits in percentages or sample counts for the training data.
Hardware Specification Yes All experiments are conducted on 8 A100 GPUs with 40GB memory each.
Software Dependencies No The paper states that codes are uploaded but does not explicitly mention any specific software dependencies with version numbers (e.g., Python, PyTorch versions).
Experiment Setup Yes For the implementation of ONPO, we follow the hyperparameters in Dong et al. [2024], including the cosine learning rate scheduler with a peak learning rate of 5 10 7, a 0.03 warm-up ratio, and a global batch size of 128. We use a grid search for 1/η over [0.1, 0.05, 0.02, 0.01, 0.005] and set 1/η = 0.01. Llama-3-SFT is trained for 5 iterations6, where in each iteration π t is trained for 2 epochs and πt for 1 epoch. While Mistral-Instruct, having already undergone instruction fine-tuning, is thereby trained for 3 iterations, with π t trained for 1 epoch and πt for 2 epochs in each iteration.