A Theory of Regularized Markov Decision Processes

Authors: Matthieu Geist, Bruno Scherrer, Olivier Pietquin

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We propose a general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: we consider a larger class of regularizers, and we consider the general modified policy iteration approach, encompassing both policy iteration and value iteration. The core building blocks of this theory are a notion of regularized Bellman operator and the Legendre-Fenchel transform, a classical tool of convex optimization. This approach allows for error propagation analyses of general algorithmic schemes of which (possibly variants of) classical algorithms such as Trust Region Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy Programming are special cases. All proofs are provided in the appendix.
Researcher Affiliation Collaboration 1Google Research, Brain Team. 2Universit e de Lorraine, CNRS, Inria, IECL, F-54000 Nancy, France.
Pseudocode No The paper presents mathematical formulations of algorithms, such as equation (1) for regularized modified policy iteration, but it does not include formal pseudocode blocks or an 'Algorithm' section.
Open Source Code No The paper does not provide any statements about releasing code or links to a code repository for the described methodology.
Open Datasets No The paper is theoretical and does not present experiments with datasets. It refers to 'observed transitions (si, ai, ri, s i)' in the context of describing how existing algorithms (like Soft Q-learning) work in general, but not as data used in the authors' own experiments.
Dataset Splits No The paper is theoretical and does not involve empirical experiments requiring dataset splits for training, validation, or testing.
Hardware Specification No The paper is theoretical and does not describe any empirical experiments, therefore no hardware specifications are mentioned.
Software Dependencies No The paper is theoretical and does not describe any empirical experiments, therefore no specific software dependencies with version numbers are mentioned.
Experiment Setup No The paper is theoretical and focuses on mathematical derivations and analyses of a general framework. It does not describe an experimental setup, hyperparameters, or training settings for empirical evaluation.