Constraint-Conditioned Policy Optimization for Versatile Safe Reinforcement Learning
Authors: Yihang Yao, ZUXIN LIU, Zhepeng Cen, Jiacheng Zhu, Wenhao Yu, Tingnan Zhang, DING ZHAO
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments demonstrate that CCPO outperforms the baselines in terms of safety and task performance, while preserving zero-shot adaptation capabilities to different constraint thresholds data-efficiently. This makes our approach suitable for real-world dynamic applications. |
| Researcher Affiliation | Collaboration | Yihang Yao 1, Zuxin Liu 1, Zhepeng Cen1, Jiacheng Zhu1,3, Wenhao Yu2, Tingnan Zhang2, Ding Zhao1 1 Carnegie Mellon University, 2 Google Deep Mind, 3 Massachusetts Institute of Technology |
| Pseudocode | Yes | The proposed CCPO for versatile safe RL and implementation details are summarized in Appendix C. |
| Open Source Code | No | No explicit statement about the availability of open-source code for the described methodology or a direct link to a code repository. |
| Open Datasets | Yes | The simulation environments are from a publicly available benchmark [56]. We consider two tasks (Run and Circle) and four robots (Ball, Car, Drone, and Ant) which have been used in many previous works as the testing ground [13 15]. |
| Dataset Splits | No | The paper mentions 'fine-tuning stage of training' and evaluation on 'threshold conditions' but does not specify explicit dataset splits (e.g., percentages or counts for train/validation/test sets) for data partitioning. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) are mentioned for the experimental setup. |
| Software Dependencies | No | The paper refers to various RL algorithms and techniques but does not provide specific software names with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x). |
| Experiment Setup | Yes | We adopt the following experiment setting to address these questions. Task. The simulation environments are from a publicly available benchmark [56]... For all the results shown in section 5.1 and 5.3, the behavior policy conditions are E = {20, 40, 60} and the threshold conditions for evaluation are set to be {10, 15, ..., 70}. We take the average of the episodic reward (Avg. R) and constraint violation (Avg. CV) as the main comparison metrics. ... Each value is reported as mean standard deviation for 50 episodes and 5 seeds. |