Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions
Authors: Simon Matrenok, Skander Moalla, Caglar Gulcehre
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, QRPO consistently achieves top performance on chat and coding evaluations reward model scores, Alpaca Eval 2, and Leet Code compared to DPO, REBEL, and Sim PO across diverse datasets and 8B-scale models. |
| Researcher Affiliation | Academia | Simon Matrenok CLAIRE, EPFL Skander Moalla CLAIRE, EPFL Caglar Gulcehre CLAIRE, EPFL |
| Pseudocode | Yes | Algorithm 1 (Offline) Quantile Reward Policy Optimization |
| Open Source Code | Yes | We release a reference implementation of QRPO based on a Hugging Face TRL (von Werra et al., 2020) trainer also supporting the baselines we trained (DPO, REBEL, Sim PO) at https://github.com/CLAIRE-Labo/quantile-reward-policy-optimization. |
| Open Datasets | Yes | Magpie-Align/Magpie-Air-DPO-100K-v0.1 (Magpie-Air) (Xu et al., 2024b) is a strong alignment dataset containing 98000 training samples (and 2000 testing samples)... Hugging Face H4/ultrafeedback binarized (Ultra Feedback) (Cui et al., 2024) consists of 61135 training samples (and 2000 testing samples)... For the coding task, we use newfacade/Leet Code Dataset (Leet Code) (Xia et al., 2025), which contains 2641 training samples (and 228 testing samples)... |
| Dataset Splits | Yes | Magpie-Air (Xu et al., 2024b) General Chat 98,000 2,000... Ultra Feedback (Cui et al., 2024) General Chat 61,135 2,000... Leet Code (Xia et al., 2025) Coding 2,641 228... We additionally split the test set in each dataset into two equal splits: a validation subset for hyperparameter selection and a test subset to report the final numbers. |
| Hardware Specification | Yes | We distribute training with Hugging Face Accelerate (Gugger et al., 2022) using the Deep Speed (Aminabadi et al., 2022) plugin at Ze RO stage 1 (optimizer state partitioning) over 8 GPUs from 2 nodes with 4 NVIDIA GH200 each. |
| Software Dependencies | No | The paper mentions Hugging Face Accelerate (Gugger et al., 2022), Deep Speed (Aminabadi et al., 2022), Hugging Face TRL (von Werra et al., 2020), and Python Machine Learning Research Template (Moalla, 2025). While it cites papers for these tools with publication years, it does not provide specific version numbers for the software dependencies themselves. |
| Experiment Setup | Yes | Table 10: Hyperparameters for supervised and RL fine-tuning. Phase Epochs Learning Rates Batch Size Optimizer LR Schedule Gradient Clipping... Table 11: RL fine-tuning hyperparameters: KL-regularization parameter (β) sweep and QRPO number of generated reference rewards. |