Constrained Markov Decision Processes via Backward Value Functions
Authors: Harsh Satija, Philip Amortila, Joelle Pineau
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of our approach is demonstrated on safe navigation tasks and in safety-constrained versions of Mu Jo Co environments, with deep neural networks. We empirically validate our approach on RL benchmarks to measure the performance of the agent with respect to the accumulated return and cost during training. |
| Researcher Affiliation | Collaboration | Harsh Satija 1 2 3 Philip Amortila 1 2 Joelle Pineau 1 2 3 Work partly done while HS was an intern at Facebook. 1Department of Computer Science, Mc Gill University, Montreal, Canada 2Mila Québec AI Institute 3Facebook AI Research, Montreal. |
| Pseudocode | Yes | Algorithm 1 A2C with Safety Layer for each actor thread |
| Open Source Code | No | The paper cites a third-party PyTorch implementation of RL algorithms (Kostrikov, 2018) which they used, but there is no explicit statement or link indicating that the authors have released their own source code for the methodology described in this paper. |
| Open Datasets | Yes | We provide empirical evidence of our approach with Deep RL methods on various safety benchmarks, including 2D navigation grid worlds (Leike et al., 2017; Chow et al., 2018), and Mu Jo Co tasks (Achiam et al., 2017; Chow et al., 2019). We design three simulated robot locomotion continuous control tasks using the Mu Jo Co simulator (Todorov et al., 2012) and Open AI Gym (Brockman et al., 2016). |
| Dataset Splits | No | The paper describes the environment setup for tasks like the Grid World (e.g., 'The size of the grid is 12x12 cells, and the pits are randomly generated...') and mentions Mu Jo Co tasks, but it does not specify traditional train/validation/test splits for a fixed dataset, as is common in reinforcement learning where models learn directly within environments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or memory specifications). |
| Software Dependencies | No | The paper mentions using neural networks, n-step SARSA, A2C, and PPO algorithms, but does not provide specific version numbers for any software libraries or frameworks (e.g., PyTorch version, TensorFlow version, Gym version). |
| Experiment Setup | Yes | Even though our formulation is based on the undiscounted case, we use discounting with γ = 0.99 for estimating the value functions in order to be consistent with the baselines. The initial starting policy in our experiments is a random policy, and due to that we adopt a safe-guard (or recovery) policy update in the same manner as (Achiam et al., 2017; Chow et al., 2019), where if the agent ends up being in an infeasible policy space, we recover by an update that purely minimizes the constraint value. More experimental details can be found in Appendix G. More details about the tasks and network architecture can be found in the Appendix H. Algorithmic details can be found in Algorithm 1 and in Appendix I. We discuss an alternate update for implementing the algorithms using target networks in Appendix J. |