Convergent Policy Optimization for Safe Reinforcement Learning

Authors: Ming Yu, Zhuoran Yang, Mladen Kolar, Zhaoran Wang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove that the solutions to these surrogate problems converge to a stationary point of the original nonconvex problem. Furthermore, to extend our theoretical results, we apply our algorithm to examples of optimal control and multi-agent reinforcement learning with safety constraints. Section 6: Experiment.
Researcher Affiliation Academia The University of Chicago Booth School of Business, Chicago, IL. Email: ming93@uchicago.edu. Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ. The University of Chicago Booth School of Business, Chicago, IL. Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL.
Pseudocode Yes Algorithm 1 Successive convex relaxation algorithm for constrained MDP. Algorithm 2 Actor-Critic update for constrained MDP.
Open Source Code Yes The code is available at https://github.com/ming93/Safe_reinforcement_learning
Open Datasets No The paper describes a simulated environment (LQR) where data is generated: 'The initial state distribution is uniform on the unit cube: x0 D = Uniform [-1, 1]15.'. It does not use or provide access information for a public, pre-existing dataset.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, or citations to predefined splits) for training, validation, or testing.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes The learning rates are set as k = 2/3k^(3/4) and k = 2/3k^(2/3). The initial state distribution is uniform on the unit cube: x0 D = Uniform [-1, 1]15. We initialize F0 as an all-zero matrix.