Is the Bellman residual a bad proxy?

Authors: Matthieu Geist, Bilal Piot, Olivier Pietquin

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper aims at theoretically and empirically comparing two standard optimization criteria for Reinforcement Learning: i) maximization of the mean value and ii) minimization of the Bellman residual. In Sec. 4, we conduct experiments on randomly generated generic Markov decision processes to compare both approaches empirically.
Researcher Affiliation Collaboration 1 Université de Lorraine & CNRS, LIEC, UMR 7360, Metz, F-57070 France 2 Univ. Lille, CNRS, Centrale Lille, Inria, UMR 9189 CRISt AL, F-59000 Lille, France 3 Now with Google Deep Mind, London, United Kingdom
Pseudocode No The paper discusses algorithmic approaches and estimation of subgradients but does not provide any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No We consider Garnet problems [2, 4]. They are a class of randomly built MDPs meant to be totally abstract while remaining representative of the problems that might be encountered in practice. Here, a Garnet G(|S|, |A|, b) is specified by the number of states, the number of actions and the branching factor.
Dataset Splits No The paper describes experimental setups and iteration counts but does not specify training, validation, or test dataset splits.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper does not list any specific software dependencies with version numbers.
Experiment Setup Yes We optimize the relative objective functions with a normalized gradient ascent (resp. normalized subgradient descent) with a constant learning rate α = 0.1. For each Garnet-feature couple, we run both algorithms for T = 1000 iterations.