Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ShiQ: Bringing back Bellman to LLMs

Authors: Pierre Clavier, Nathan Grinsztajn, Raphaël Avalos, Yannis Flet-Berliac, Irem Ergun, Omar Darwiche Domingues, Olivier Pietquin, Pierre Richemond, Florian Strub, Matthieu Geist

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we evaluate Shi Q on both synthetic data and real-world benchmarks, e.g., Ultra Feedback, BFCL-V3, demonstrating its effectiveness in both single-turn and multi-turn LLM settings.
Researcher Affiliation Collaboration Pierre Clavier1, Nathan Grinsztajn2, Raphael Avalos1, Yannis Flet-Berliac1, Irem Ergun1, Omar D. Domingues1, Olivier Pietquin2,3, Pierre H. Richemond1, Florian Strub1, and Matthieu Geist2,3 1Cohere 2Work done at Cohere 3Earth Species Project EMAIL ; EMAIL
Pseudocode No In this section, we outline the three principal components of our method culminating in the Shi Q algorithm. In Sec. 2.1, we adopt RL notations to derive the Bellman consistency equations... Finally, in Sec. 3, we restate the algorithm using LLM notation to simplify the implementation.
Open Source Code No The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. Do not remove the checklist: The papers not including the checklist will be desk rejected. The checklist should follow the references and follow the (optional) supplemental material. The checklist does NOT count towards the page limit. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: All the data is public data, and models are open weights models.
Open Datasets Yes Finally, we evaluate Shi Q on both synthetic data and real-world benchmarks, e.g., Ultra Feedback, BFCL-V3, demonstrating its effectiveness in both single-turn and multi-turn LLM settings. We evaluate on the open-source Anthropic-Harmless and Anthropic-Helpful datasets [Bai et al., 2022] and Ultra Feedback [Cui et al., 2023], chosen for their publicly available reward labels. We evaluate function-calling capabilities using the BFCL-V3 dataset introduced in the Gorilla framework by Patil et al. [2024]
Dataset Splits Yes We divide the 200 samples of BFCL-v3 into 40 representative samples in the test and the rest in the training set.
Hardware Specification No Experiments were conducted on NVIDIA GPUs using the Harmful Harmless, Ultra Feedback, and BFCL-v3 datasets.
Software Dependencies No The paper does not explicitly state specific software dependencies with version numbers.
Experiment Setup Yes Each policy ˆπ is trained using stochastic gradient descent with the Adam optimizer, a learning rate of 10 3, batch size of 256, and for a total of 100 epochs. Regarding training, the models were trained for one epoch while sweeping over the parameter β in the set {0.001, 0.01, 0.1, 1} and picking the best β. A learning rate of 1 10 6 was chosen for all experiments. Regarding BFCL training, models were trained for one epoch while sweeping over the parameter β in the set {0.001, 0.01, 0.1} and picking the best β. A learning rate of 1 10 6 was chosen for all experiments.