Is the Bellman residual a bad proxy?
Authors: Matthieu Geist, Bilal Piot, Olivier Pietquin
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper aims at theoretically and empirically comparing two standard optimization criteria for Reinforcement Learning: i) maximization of the mean value and ii) minimization of the Bellman residual. In Sec. 4, we conduct experiments on randomly generated generic Markov decision processes to compare both approaches empirically. |
| Researcher Affiliation | Collaboration | 1 Université de Lorraine & CNRS, LIEC, UMR 7360, Metz, F-57070 France 2 Univ. Lille, CNRS, Centrale Lille, Inria, UMR 9189 CRISt AL, F-59000 Lille, France 3 Now with Google Deep Mind, London, United Kingdom |
| Pseudocode | No | The paper discusses algorithmic approaches and estimation of subgradients but does not provide any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | We consider Garnet problems [2, 4]. They are a class of randomly built MDPs meant to be totally abstract while remaining representative of the problems that might be encountered in practice. Here, a Garnet G(|S|, |A|, b) is specified by the number of states, the number of actions and the branching factor. |
| Dataset Splits | No | The paper describes experimental setups and iteration counts but does not specify training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not list any specific software dependencies with version numbers. |
| Experiment Setup | Yes | We optimize the relative objective functions with a normalized gradient ascent (resp. normalized subgradient descent) with a constant learning rate α = 0.1. For each Garnet-feature couple, we run both algorithms for T = 1000 iterations. |