Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

Authors: Jin Peng Zhou, Kaiwen Wang, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kilian Q. Weinberger, Kianté Brantley, Wen Sun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, Q outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. ... We provide extensive experiments on math reasoning tasks that validate the effectiveness of our method at maximizing reward while maintaining small KL deviations from the reference policy (Section 3.2).
Researcher Affiliation Collaboration 1Cornell University 2Harvard University 3Netflix 4Databricks
Pseudocode Yes Algorithm 1 Q 1: Input: reference policy πref. 2: Initialize parameters θ1 of conditional distribution Zθ : X Y (R) and dataset Dh = for all h. 3: for k = 1, 2, . . . until convergence do
Open Source Code Yes The code can be found at https://github.com/jinpz/q_sharp.
Open Datasets Yes We evaluate on two mathematical reasoning benchmarks: GSM8K [19], a dataset of grade school arithmetic word problems, and MATH [43], which features more challenging high school competition problems. ... In Appendix G, we also evaluate Q on AIME-24 dataset. ... To further validate the generality of Q beyond mathematical reasoning tasks, we evaluate its performance on Qu ALITY [47]
Dataset Splits Yes We split each training set 90%-10% for training and validation. Test performance is reported on the full GSM8K test set and a 500-sample subset of MATH (MATH-500), following prior work [44, 45].
Hardware Specification Yes All models are trained on a single A100 or H100 GPU. ... On an Nvidia A6000, generating one response on test set of MATH takes 4.10s for πref and 5.18s for Q , slightly exceeding 12.5% possibly due to sequential Q function computation in Logit Processor.
Software Dependencies No The paper mentions 'Adam W optimizer [91]' but does not specify any software versions for programming languages or libraries.
Experiment Setup Yes Throughout, we used the Adam W optimizer with weight decay 0.1 and batch size of 256, and trained for 10 epochs. ... We use a learning rate of 2e 5 and weight decay of 0.01 with Adam W optimizer [91]. The model is trained for 5 epochs. We train Q for two iterations... Unless otherwise noted, the Q ,η function in Q is parameterized and initialized with a Llama 3.2 1B model, and we use η = 0.1, which yields consistent and strong performance.