Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Doubly Robust Alignment for Large Language Models

Authors: Erhan Xu, Kai Ye, Hongyi Zhou, Luhan Zhu, Francesco Quinzan, Chengchun Shi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM... Our empirical results reinforce the theoretical advantages, demonstrating DRPO s robustness to reference-policy misspecification (Table 2; Figure 4, left) and preference-model misspecification (Table 3; Figure 4, right).
Researcher Affiliation Academia Erhan Xu Department of Statistics LSE London, UK Kai Ye Department of Statistics LSE London, UK Hongyi Zhou Department of Mathematics Tsinghua University Beijing, China Luhan Zhu School of Design LCC, UAL London, UK Francesco Quinzan Department of Engineering Science University of Oxford Oxford, UK Chengchun Shi Department of Statistics LSE London, UK
Pseudocode Yes Algorithm 1 Double Robust Preference Optimization
Open Source Code Yes The code is available at https://github.com/DRPO4LLM/DRPO4LLM
Open Datasets Yes In this section, we first use the IMDb dataset [152] to empirically validate the double robustness property of our preference estimator bp DR (Equation 9) established in Corollary 3. We next compare the proposed preference optimization algorithm (Equation 10) against baseline approaches on the Too Long; Didn t Read [TL;DR, 153] and Anthropic Helpful and Harmless [HH, 8] datasets.
Dataset Splits Yes Both models are trained for three epochs on 25,000 samples from the IMDb training set... for each of the 25,000 prefixes in the IMDb test set
Hardware Specification Yes The Preference Evaluation experiments are conducted on a machine equipped with an NVIDIA RTX 6000 Ada GPU and an AMD Ryzen Threadripper PRO 7945WX 12-core CPU. The Preference Optimization experiments are performed on a system with an H20 NVLink GPU and a 20 v CPU Intel(R) Xeon(R) Platinum 8457C processor.
Software Dependencies No For the baseline models training, we follow the framework of TRL: Transformer Reinforcement Learning [156] and Transformers: State-of-the-Art Natural Language Processing [164]. For the general preference model, we follow the framework of general-preference/general-preference-model proposed by Zhang et al. [19]. All models were trained with default hyperparameter configurations unless otherwise specified. Adam W [165] are used as default optimizer.
Experiment Setup Yes For PPO training, we search the hyperparameter over the KL coefficient β {0.05, 0.1, 0.2} and select β = 0.05 based on empirical performance... To ensure a fair comparison, we set the maximum response length to 128 for all models... We further conduct a hyperparameter search over KL coefficients β {0.05, 0.1, 0.2} and learning rates in {1e-7, 1e-6, 3e-6}. We select a KL coefficient of 0.05 combined with a learning rate of 1e-7... For both tasks, we set the clipping range to [0.04, 2.5]... The regularization parameter β is set to 0.04... The number of Monte Carlo samples |D | is set to 3 (TL;DR) or 2 (HH).