Temporal Regularization for Markov Decision Process

Authors: Pierre Thodoroff, Audrey Durand, Joelle Pineau, Doina Precup

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper explores a class of methods for temporal regularization. We formally characterize the bias induced by this technique using Markov chain concepts. We illustrate the various characteristics of temporal regularization via a sequence of simple discrete and continuous MDPs, and show that the technique provides improvement even in high-dimensional Atari games.
Researcher Affiliation Collaboration Pierre Thodoroff Mc Gill University pierre.thodoroff@mail.mcgill.ca Audrey Durand Mc Gill University audrey.durand@mcgill.ca Joelle Pineau Mc Gill University & Facebook AI Research jpineau@cs.mcgill.ca Doina Precup Mc Gill University dprecup@cs.mcgill.ca
Pseudocode Yes Algorithm 1 Policy evaluation with exponential smoothing
Open Source Code Yes Our implementation2 is based on the publicly available Open AI codebase [8]. The code can be found https://github.com/pierthodo/temporal_regularization.
Open Datasets Yes We evaluate the performance in the Arcade Learning Environment [4]
Dataset Splits No The paper mentions 10 independent runs with random seeds but does not explicitly describe train/validation/test splits or a cross-validation setup for the datasets used in experiments.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance specifications used for running the experiments.
Software Dependencies No The paper mentions using the "Open AI codebase" but does not specify version numbers for this or any other software dependencies.
Experiment Setup Yes The hyper-parameters for the temporal regularization are β = λ = 0.2 and a decay of 1e -5. Those are selected on 7 games and 3 training seeds. All other hyper-parameters correspond to the one used in the PPO paper.