Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

Authors: Simon Matrenok, Skander Moalla, Caglar Gulcehre

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, QRPO consistently achieves top performance on chat and coding evaluations reward model scores, Alpaca Eval 2, and Leet Code compared to DPO, REBEL, and Sim PO across diverse datasets and 8B-scale models.
Researcher Affiliation Academia Simon Matrenok CLAIRE, EPFL Skander Moalla CLAIRE, EPFL Caglar Gulcehre CLAIRE, EPFL
Pseudocode Yes Algorithm 1 (Offline) Quantile Reward Policy Optimization
Open Source Code Yes We release a reference implementation of QRPO based on a Hugging Face TRL (von Werra et al., 2020) trainer also supporting the baselines we trained (DPO, REBEL, Sim PO) at https://github.com/CLAIRE-Labo/quantile-reward-policy-optimization.
Open Datasets Yes Magpie-Align/Magpie-Air-DPO-100K-v0.1 (Magpie-Air) (Xu et al., 2024b) is a strong alignment dataset containing 98000 training samples (and 2000 testing samples)... Hugging Face H4/ultrafeedback binarized (Ultra Feedback) (Cui et al., 2024) consists of 61135 training samples (and 2000 testing samples)... For the coding task, we use newfacade/Leet Code Dataset (Leet Code) (Xia et al., 2025), which contains 2641 training samples (and 228 testing samples)...
Dataset Splits Yes Magpie-Air (Xu et al., 2024b) General Chat 98,000 2,000... Ultra Feedback (Cui et al., 2024) General Chat 61,135 2,000... Leet Code (Xia et al., 2025) Coding 2,641 228... We additionally split the test set in each dataset into two equal splits: a validation subset for hyperparameter selection and a test subset to report the final numbers.
Hardware Specification Yes We distribute training with Hugging Face Accelerate (Gugger et al., 2022) using the Deep Speed (Aminabadi et al., 2022) plugin at Ze RO stage 1 (optimizer state partitioning) over 8 GPUs from 2 nodes with 4 NVIDIA GH200 each.
Software Dependencies No The paper mentions Hugging Face Accelerate (Gugger et al., 2022), Deep Speed (Aminabadi et al., 2022), Hugging Face TRL (von Werra et al., 2020), and Python Machine Learning Research Template (Moalla, 2025). While it cites papers for these tools with publication years, it does not provide specific version numbers for the software dependencies themselves.
Experiment Setup Yes Table 10: Hyperparameters for supervised and RL fine-tuning. Phase Epochs Learning Rates Batch Size Optimizer LR Schedule Gradient Clipping... Table 11: RL fine-tuning hyperparameters: KL-regularization parameter (β) sweep and QRPO number of generated reference rewards.