Constrained Variational Policy Optimization for Safe Reinforcement Learning

Authors: Zuxin Liu, Zhepeng Cen, Vladislav Isenbaev, Wei Liu, Steven Wu, Bo Li, Ding Zhao

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental A wide range of experiments on continuous robotic tasks shows that the proposed method achieves significantly better constraint satisfaction performance and better sample efficiency than baselines.
Researcher Affiliation Collaboration 1Carnegie Mellon University 2Nuro Inc. 3University of Illinois Urbana-Champaign. Correspondence to: Zuxin Liu <zuxinl@cmu.edu>, Ding Zhao <dingzhao@cmu.edu>.
Pseudocode Yes Algorithm 1 CVPO Training for One Epoch
Open Source Code Yes The code is available at https://github.com/ liuzuxin/cvpo-safe-rl.
Open Datasets Yes The task environment implementations are built upon Safety Gym (based on Mujoco) (Ray et al., 2019) and its Py Bullet implementation (Gronauer, 2022).
Dataset Splits No The paper describes experiments in reinforcement learning environments, which typically involve continuous interaction rather than static train/validation/test dataset splits. No explicit percentage or sample counts for validation splits are provided.
Hardware Specification No The paper does not provide specific details regarding the hardware used for running the experiments (e.g., GPU models, CPU types, or memory specifications).
Software Dependencies No The paper mentions software like
Experiment Setup Yes The hyperparameters are shown in Table 1. More details can be found in the code. Common Hyperparameters CVPO Hyperparameter Policy network sizes [256, 256] Q network sizes [256, 256] Network activation Re LU Discount factor gamma γ 0.99 Polyak weight ρ: 0.995 Batch size B: 300 Rollout trajectory number T 20 Critics learning rate αc 0.001 NN Optimizer Adam Particle size K 32 M-step iterations M 6 Learning rate αµ 1 Learning rate αΣ 100 Learning rate αθ 0.002 E-step KL threshold ϵ2: 0.1 M-step KL threshold ϵµ: 0.001 M-step KL threshold ϵΣ: 0.0001 E-step solver SLSQP