reproducibilityindex.ai

Constraint-Conditioned Policy Optimization for Versatile Safe Reinforcement Learning

Authors: Yihang Yao, ZUXIN LIU, Zhepeng Cen, Jiacheng Zhu, Wenhao Yu, Tingnan Zhang, DING ZHAO

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments demonstrate that CCPO outperforms the baselines in terms of safety and task performance, while preserving zero-shot adaptation capabilities to different constraint thresholds data-efﬁciently. This makes our approach suitable for real-world dynamic applications.
Researcher Affiliation	Collaboration	Yihang Yao 1, Zuxin Liu 1, Zhepeng Cen1, Jiacheng Zhu1,3, Wenhao Yu2, Tingnan Zhang2, Ding Zhao1 1 Carnegie Mellon University, 2 Google Deep Mind, 3 Massachusetts Institute of Technology
Pseudocode	Yes	The proposed CCPO for versatile safe RL and implementation details are summarized in Appendix C.
Open Source Code	No	No explicit statement about the availability of open-source code for the described methodology or a direct link to a code repository.
Open Datasets	Yes	The simulation environments are from a publicly available benchmark [56]. We consider two tasks (Run and Circle) and four robots (Ball, Car, Drone, and Ant) which have been used in many previous works as the testing ground [13 15].
Dataset Splits	No	The paper mentions 'fine-tuning stage of training' and evaluation on 'threshold conditions' but does not specify explicit dataset splits (e.g., percentages or counts for train/validation/test sets) for data partitioning.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) are mentioned for the experimental setup.
Software Dependencies	No	The paper refers to various RL algorithms and techniques but does not provide specific software names with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup	Yes	We adopt the following experiment setting to address these questions. Task. The simulation environments are from a publicly available benchmark [56]... For all the results shown in section 5.1 and 5.3, the behavior policy conditions are E = {20, 40, 60} and the threshold conditions for evaluation are set to be {10, 15, ..., 70}. We take the average of the episodic reward (Avg. R) and constraint violation (Avg. CV) as the main comparison metrics. ... Each value is reported as mean standard deviation for 50 episodes and 5 seeds.