Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Is the Bellman residual a bad proxy?
Authors: Matthieu Geist, Bilal Piot, Olivier Pietquin
NeurIPS 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper aims at theoretically and empirically comparing two standard optimization criteria for Reinforcement Learning: i) maximization of the mean value and ii) minimization of the Bellman residual. In Sec. 4, we conduct experiments on randomly generated generic Markov decision processes to compare both approaches empirically. |
| Researcher Affiliation | Collaboration | 1 Université de Lorraine & CNRS, LIEC, UMR 7360, Metz, F-57070 France 2 Univ. Lille, CNRS, Centrale Lille, Inria, UMR 9189 CRISt AL, F-59000 Lille, France 3 Now with Google Deep Mind, London, United Kingdom |
| Pseudocode | No | The paper discusses algorithmic approaches and estimation of subgradients but does not provide any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | We consider Garnet problems [2, 4]. They are a class of randomly built MDPs meant to be totally abstract while remaining representative of the problems that might be encountered in practice. Here, a Garnet G(|S|, |A|, b) is specified by the number of states, the number of actions and the branching factor. |
| Dataset Splits | No | The paper describes experimental setups and iteration counts but does not specify training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not list any specific software dependencies with version numbers. |
| Experiment Setup | Yes | We optimize the relative objective functions with a normalized gradient ascent (resp. normalized subgradient descent) with a constant learning rate α = 0.1. For each Garnet-feature couple, we run both algorithms for T = 1000 iterations. |