reproducibilityindex.ai

Convergent Policy Optimization for Safe Reinforcement Learning

Authors: Ming Yu, Zhuoran Yang, Mladen Kolar, Zhaoran Wang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We prove that the solutions to these surrogate problems converge to a stationary point of the original nonconvex problem. Furthermore, to extend our theoretical results, we apply our algorithm to examples of optimal control and multi-agent reinforcement learning with safety constraints. Section 6: Experiment.
Researcher Affiliation	Academia	The University of Chicago Booth School of Business, Chicago, IL. Email: ming93@uchicago.edu. Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ. The University of Chicago Booth School of Business, Chicago, IL. Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL.
Pseudocode	Yes	Algorithm 1 Successive convex relaxation algorithm for constrained MDP. Algorithm 2 Actor-Critic update for constrained MDP.
Open Source Code	Yes	The code is available at https://github.com/ming93/Safe_reinforcement_learning
Open Datasets	No	The paper describes a simulated environment (LQR) where data is generated: 'The initial state distribution is uniform on the unit cube: x0 D = Uniform [-1, 1]15.'. It does not use or provide access information for a public, pre-existing dataset.
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, or citations to predefined splits) for training, validation, or testing.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	The learning rates are set as k = 2/3k^(3/4) and k = 2/3k^(2/3). The initial state distribution is uniform on the unit cube: x0 D = Uniform [-1, 1]15. We initialize F0 as an all-zero matrix.