Convergent Policy Optimization for Safe Reinforcement Learning
Authors: Ming Yu, Zhuoran Yang, Mladen Kolar, Zhaoran Wang
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove that the solutions to these surrogate problems converge to a stationary point of the original nonconvex problem. Furthermore, to extend our theoretical results, we apply our algorithm to examples of optimal control and multi-agent reinforcement learning with safety constraints. Section 6: Experiment. |
| Researcher Affiliation | Academia | The University of Chicago Booth School of Business, Chicago, IL. Email: ming93@uchicago.edu. Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ. The University of Chicago Booth School of Business, Chicago, IL. Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL. |
| Pseudocode | Yes | Algorithm 1 Successive convex relaxation algorithm for constrained MDP. Algorithm 2 Actor-Critic update for constrained MDP. |
| Open Source Code | Yes | The code is available at https://github.com/ming93/Safe_reinforcement_learning |
| Open Datasets | No | The paper describes a simulated environment (LQR) where data is generated: 'The initial state distribution is uniform on the unit cube: x0 D = Uniform [-1, 1]15.'. It does not use or provide access information for a public, pre-existing dataset. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, or citations to predefined splits) for training, validation, or testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | The learning rates are set as k = 2/3k^(3/4) and k = 2/3k^(2/3). The initial state distribution is uniform on the unit cube: x0 D = Uniform [-1, 1]15. We initialize F0 as an all-zero matrix. |