Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ShiQ: Bringing back Bellman to LLMs
Authors: Pierre Clavier, Nathan Grinsztajn, Raphaël Avalos, Yannis Flet-Berliac, Irem Ergun, Omar Darwiche Domingues, Olivier Pietquin, Pierre Richemond, Florian Strub, Matthieu Geist
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we evaluate Shi Q on both synthetic data and real-world benchmarks, e.g., Ultra Feedback, BFCL-V3, demonstrating its effectiveness in both single-turn and multi-turn LLM settings. |
| Researcher Affiliation | Collaboration | Pierre Clavier1, Nathan Grinsztajn2, Raphael Avalos1, Yannis Flet-Berliac1, Irem Ergun1, Omar D. Domingues1, Olivier Pietquin2,3, Pierre H. Richemond1, Florian Strub1, and Matthieu Geist2,3 1Cohere 2Work done at Cohere 3Earth Species Project EMAIL ; EMAIL |
| Pseudocode | No | In this section, we outline the three principal components of our method culminating in the Shi Q algorithm. In Sec. 2.1, we adopt RL notations to derive the Bellman consistency equations... Finally, in Sec. 3, we restate the algorithm using LLM notation to simplify the implementation. |
| Open Source Code | No | The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. Do not remove the checklist: The papers not including the checklist will be desk rejected. The checklist should follow the references and follow the (optional) supplemental material. The checklist does NOT count towards the page limit. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: All the data is public data, and models are open weights models. |
| Open Datasets | Yes | Finally, we evaluate Shi Q on both synthetic data and real-world benchmarks, e.g., Ultra Feedback, BFCL-V3, demonstrating its effectiveness in both single-turn and multi-turn LLM settings. We evaluate on the open-source Anthropic-Harmless and Anthropic-Helpful datasets [Bai et al., 2022] and Ultra Feedback [Cui et al., 2023], chosen for their publicly available reward labels. We evaluate function-calling capabilities using the BFCL-V3 dataset introduced in the Gorilla framework by Patil et al. [2024] |
| Dataset Splits | Yes | We divide the 200 samples of BFCL-v3 into 40 representative samples in the test and the rest in the training set. |
| Hardware Specification | No | Experiments were conducted on NVIDIA GPUs using the Harmful Harmless, Ultra Feedback, and BFCL-v3 datasets. |
| Software Dependencies | No | The paper does not explicitly state specific software dependencies with version numbers. |
| Experiment Setup | Yes | Each policy ˆπ is trained using stochastic gradient descent with the Adam optimizer, a learning rate of 10 3, batch size of 256, and for a total of 100 epochs. Regarding training, the models were trained for one epoch while sweeping over the parameter β in the set {0.001, 0.01, 0.1, 1} and picking the best β. A learning rate of 1 10 6 was chosen for all experiments. Regarding BFCL training, models were trained for one epoch while sweeping over the parameter β in the set {0.001, 0.01, 0.1} and picking the best β. A learning rate of 1 10 6 was chosen for all experiments. |