reproducibilityindex.ai

Is the Bellman residual a bad proxy?

Authors: Matthieu Geist, Bilal Piot, Olivier Pietquin

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper aims at theoretically and empirically comparing two standard optimization criteria for Reinforcement Learning: i) maximization of the mean value and ii) minimization of the Bellman residual. In Sec. 4, we conduct experiments on randomly generated generic Markov decision processes to compare both approaches empirically.
Researcher Affiliation	Collaboration	1 Université de Lorraine & CNRS, LIEC, UMR 7360, Metz, F-57070 France 2 Univ. Lille, CNRS, Centrale Lille, Inria, UMR 9189 CRISt AL, F-59000 Lille, France 3 Now with Google Deep Mind, London, United Kingdom
Pseudocode	No	The paper discusses algorithmic approaches and estimation of subgradients but does not provide any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	No	We consider Garnet problems [2, 4]. They are a class of randomly built MDPs meant to be totally abstract while remaining representative of the problems that might be encountered in practice. Here, a Garnet G(\|S\|, \|A\|, b) is specified by the number of states, the number of actions and the branching factor.
Dataset Splits	No	The paper describes experimental setups and iteration counts but does not specify training, validation, or test dataset splits.
Hardware Specification	No	The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper does not list any specific software dependencies with version numbers.
Experiment Setup	Yes	We optimize the relative objective functions with a normalized gradient ascent (resp. normalized subgradient descent) with a constant learning rate α = 0.1. For each Garnet-feature couple, we run both algorithms for T = 1000 iterations.