Constraint-Conditioned Policy Optimization for Versatile Safe Reinforcement Learning

Authors: Yihang Yao, ZUXIN LIU, Zhepeng Cen, Jiacheng Zhu, Wenhao Yu, Tingnan Zhang, DING ZHAO

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments demonstrate that CCPO outperforms the baselines in terms of safety and task performance, while preserving zero-shot adaptation capabilities to different constraint thresholds data-efficiently. This makes our approach suitable for real-world dynamic applications.
Researcher Affiliation Collaboration Yihang Yao 1, Zuxin Liu 1, Zhepeng Cen1, Jiacheng Zhu1,3, Wenhao Yu2, Tingnan Zhang2, Ding Zhao1 1 Carnegie Mellon University, 2 Google Deep Mind, 3 Massachusetts Institute of Technology
Pseudocode Yes The proposed CCPO for versatile safe RL and implementation details are summarized in Appendix C.
Open Source Code No No explicit statement about the availability of open-source code for the described methodology or a direct link to a code repository.
Open Datasets Yes The simulation environments are from a publicly available benchmark [56]. We consider two tasks (Run and Circle) and four robots (Ball, Car, Drone, and Ant) which have been used in many previous works as the testing ground [13 15].
Dataset Splits No The paper mentions 'fine-tuning stage of training' and evaluation on 'threshold conditions' but does not specify explicit dataset splits (e.g., percentages or counts for train/validation/test sets) for data partitioning.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) are mentioned for the experimental setup.
Software Dependencies No The paper refers to various RL algorithms and techniques but does not provide specific software names with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup Yes We adopt the following experiment setting to address these questions. Task. The simulation environments are from a publicly available benchmark [56]... For all the results shown in section 5.1 and 5.3, the behavior policy conditions are E = {20, 40, 60} and the threshold conditions for evaluation are set to be {10, 15, ..., 70}. We take the average of the episodic reward (Avg. R) and constraint violation (Avg. CV) as the main comparison metrics. ... Each value is reported as mean standard deviation for 50 episodes and 5 seeds.