Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk

Authors: ChengYang Ying, Xinning Zhou, Hang Su, Dong Yan, Ning Chen, Jun Zhu

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that CPPO achieves a higher cumulative reward and is more robust against both observation and transition disturbances on a series of continuous control tasks in Mu Jo Co.
Researcher Affiliation Collaboration 1Department of Computer Science & Technology, Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University; 2Peng Cheng Laboratory 3 Tsinghua University-China Mobile Communications Group Co., Ltd. Joint Institute
Pseudocode Yes Algorithm 1 CVa R Proximal Policy Optimization (CPPO)
Open Source Code No The implementation of all code, including CPPO and baselines, are based on the codebase Spinning Up. (This indicates they used an existing codebase, not that they released their own specific code for this paper.)
Open Datasets Yes We choose Mu Jo Co [Todorov et al., 2012] as our experimental environment. As a robotic locomotion simulator, Mu Jo Co has lots of different continuous control tasks like Ant, Half Cheetah, Walker2d, Swimmer and Hopper, which are widely used for the evaluation of RL algorithms.
Dataset Splits No The paper describes training and evaluating performance but does not specify explicit train/validation/test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using 'Adam' for optimization and that code is 'based on the codebase Spinning Up', but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup No The 'Experiment Setup' section (5.1) describes the general environments, baselines, and evaluation strategies, but it does not provide concrete hyperparameter values (e.g., learning rate, batch size, number of epochs) or specific system-level training configurations in the main text.