Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk
Authors: ChengYang Ying, Xinning Zhou, Hang Su, Dong Yan, Ning Chen, Jun Zhu
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that CPPO achieves a higher cumulative reward and is more robust against both observation and transition disturbances on a series of continuous control tasks in Mu Jo Co. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science & Technology, Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University; 2Peng Cheng Laboratory 3 Tsinghua University-China Mobile Communications Group Co., Ltd. Joint Institute |
| Pseudocode | Yes | Algorithm 1 CVa R Proximal Policy Optimization (CPPO) |
| Open Source Code | No | The implementation of all code, including CPPO and baselines, are based on the codebase Spinning Up. (This indicates they used an existing codebase, not that they released their own specific code for this paper.) |
| Open Datasets | Yes | We choose Mu Jo Co [Todorov et al., 2012] as our experimental environment. As a robotic locomotion simulator, Mu Jo Co has lots of different continuous control tasks like Ant, Half Cheetah, Walker2d, Swimmer and Hopper, which are widely used for the evaluation of RL algorithms. |
| Dataset Splits | No | The paper describes training and evaluating performance but does not specify explicit train/validation/test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Adam' for optimization and that code is 'based on the codebase Spinning Up', but it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | No | The 'Experiment Setup' section (5.1) describes the general environments, baselines, and evaluation strategies, but it does not provide concrete hyperparameter values (e.g., learning rate, batch size, number of epochs) or specific system-level training configurations in the main text. |