Augmented Proximal Policy Optimization for Safe Reinforcement Learning
Authors: Juntao Dai, Jiaming Ji, Long Yang, Qian Zheng, Gang Pan
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply our APPO methods in diverse safety-constrained tasks, setting a new state of the art compared with a comprehensive list of safe RL baselines. Extensive experiments verify the merits of our method in easy implementation, stable convergence, and precise cost control. |
| Researcher Affiliation | Academia | 1 The State Key Lab of Brain-Machine Intelligence, Zhejiang University, Hangzhou, China 2College of Computer Science and Technology, Zhejiang University, Hangzhou, China 3School of Artificial Intelligence, Peking University, Beijing, China |
| Pseudocode | Yes | We present the pseudo-code of APPO in Algorithm 1. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-source code of the described methodology. |
| Open Datasets | Yes | For a comprehensive evaluation, we select four representative tasks from three well-known safe RL benchmark environments (Safe Mu Jo Co (Zhang, Vuong, and Ross 2020), Safety Gym (Ray, Achiam, and Amodei 2019), and Bullet Safety Gym (Gronauer 2022)) as our experimental scenarios. |
| Dataset Splits | No | The paper does not provide specific details about train/validation/test dataset splits, such as percentages or sample counts. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch versions) needed for replication. |
| Experiment Setup | No | The paper describes general aspects of the training process and adaptive hyperparameter adjustment (e.g., for penalty factor and multiplier learning rate) but it does not provide specific numerical values for common hyperparameters like learning rate, batch size, or number of epochs in the main text. |