Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Doubly Robust Alignment for Large Language Models
Authors: Erhan Xu, Kai Ye, Hongyi Zhou, Luhan Zhu, Francesco Quinzan, Chengchun Shi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM... Our empirical results reinforce the theoretical advantages, demonstrating DRPO s robustness to reference-policy misspecification (Table 2; Figure 4, left) and preference-model misspecification (Table 3; Figure 4, right). |
| Researcher Affiliation | Academia | Erhan Xu Department of Statistics LSE London, UK Kai Ye Department of Statistics LSE London, UK Hongyi Zhou Department of Mathematics Tsinghua University Beijing, China Luhan Zhu School of Design LCC, UAL London, UK Francesco Quinzan Department of Engineering Science University of Oxford Oxford, UK Chengchun Shi Department of Statistics LSE London, UK |
| Pseudocode | Yes | Algorithm 1 Double Robust Preference Optimization |
| Open Source Code | Yes | The code is available at https://github.com/DRPO4LLM/DRPO4LLM |
| Open Datasets | Yes | In this section, we first use the IMDb dataset [152] to empirically validate the double robustness property of our preference estimator bp DR (Equation 9) established in Corollary 3. We next compare the proposed preference optimization algorithm (Equation 10) against baseline approaches on the Too Long; Didn t Read [TL;DR, 153] and Anthropic Helpful and Harmless [HH, 8] datasets. |
| Dataset Splits | Yes | Both models are trained for three epochs on 25,000 samples from the IMDb training set... for each of the 25,000 prefixes in the IMDb test set |
| Hardware Specification | Yes | The Preference Evaluation experiments are conducted on a machine equipped with an NVIDIA RTX 6000 Ada GPU and an AMD Ryzen Threadripper PRO 7945WX 12-core CPU. The Preference Optimization experiments are performed on a system with an H20 NVLink GPU and a 20 v CPU Intel(R) Xeon(R) Platinum 8457C processor. |
| Software Dependencies | No | For the baseline models training, we follow the framework of TRL: Transformer Reinforcement Learning [156] and Transformers: State-of-the-Art Natural Language Processing [164]. For the general preference model, we follow the framework of general-preference/general-preference-model proposed by Zhang et al. [19]. All models were trained with default hyperparameter configurations unless otherwise specified. Adam W [165] are used as default optimizer. |
| Experiment Setup | Yes | For PPO training, we search the hyperparameter over the KL coefficient β {0.05, 0.1, 0.2} and select β = 0.05 based on empirical performance... To ensure a fair comparison, we set the maximum response length to 128 for all models... We further conduct a hyperparameter search over KL coefficients β {0.05, 0.1, 0.2} and learning rates in {1e-7, 1e-6, 3e-6}. We select a KL coefficient of 0.05 combined with a learning rate of 1e-7... For both tasks, we set the clipping range to [0.04, 2.5]... The regularization parameter β is set to 0.04... The number of Monte Carlo samples |D | is set to 3 (TL;DR) or 2 (HH). |