reproducibilityindex.ai

Constrained Markov Decision Processes via Backward Value Functions

Authors: Harsh Satija, Philip Amortila, Joelle Pineau

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The effectiveness of our approach is demonstrated on safe navigation tasks and in safety-constrained versions of Mu Jo Co environments, with deep neural networks. We empirically validate our approach on RL benchmarks to measure the performance of the agent with respect to the accumulated return and cost during training.
Researcher Affiliation	Collaboration	Harsh Satija 1 2 3 Philip Amortila 1 2 Joelle Pineau 1 2 3 Work partly done while HS was an intern at Facebook. 1Department of Computer Science, Mc Gill University, Montreal, Canada 2Mila Québec AI Institute 3Facebook AI Research, Montreal.
Pseudocode	Yes	Algorithm 1 A2C with Safety Layer for each actor thread
Open Source Code	No	The paper cites a third-party PyTorch implementation of RL algorithms (Kostrikov, 2018) which they used, but there is no explicit statement or link indicating that the authors have released their own source code for the methodology described in this paper.
Open Datasets	Yes	We provide empirical evidence of our approach with Deep RL methods on various safety benchmarks, including 2D navigation grid worlds (Leike et al., 2017; Chow et al., 2018), and Mu Jo Co tasks (Achiam et al., 2017; Chow et al., 2019). We design three simulated robot locomotion continuous control tasks using the Mu Jo Co simulator (Todorov et al., 2012) and Open AI Gym (Brockman et al., 2016).
Dataset Splits	No	The paper describes the environment setup for tasks like the Grid World (e.g., 'The size of the grid is 12x12 cells, and the pits are randomly generated...') and mentions Mu Jo Co tasks, but it does not specify traditional train/validation/test splits for a fixed dataset, as is common in reinforcement learning where models learn directly within environments.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or memory specifications).
Software Dependencies	No	The paper mentions using neural networks, n-step SARSA, A2C, and PPO algorithms, but does not provide specific version numbers for any software libraries or frameworks (e.g., PyTorch version, TensorFlow version, Gym version).
Experiment Setup	Yes	Even though our formulation is based on the undiscounted case, we use discounting with γ = 0.99 for estimating the value functions in order to be consistent with the baselines. The initial starting policy in our experiments is a random policy, and due to that we adopt a safe-guard (or recovery) policy update in the same manner as (Achiam et al., 2017; Chow et al., 2019), where if the agent ends up being in an infeasible policy space, we recover by an update that purely minimizes the constraint value. More experimental details can be found in Appendix G. More details about the tasks and network architecture can be found in the Appendix H. Algorithmic details can be found in Algorithm 1 and in Appendix I. We discuss an alternate update for implementing the algorithms using target networks in Appendix J.