Constrained Variational Policy Optimization for Safe Reinforcement Learning
Authors: Zuxin Liu, Zhepeng Cen, Vladislav Isenbaev, Wei Liu, Steven Wu, Bo Li, Ding Zhao
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | A wide range of experiments on continuous robotic tasks shows that the proposed method achieves significantly better constraint satisfaction performance and better sample efficiency than baselines. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Nuro Inc. 3University of Illinois Urbana-Champaign. Correspondence to: Zuxin Liu <zuxinl@cmu.edu>, Ding Zhao <dingzhao@cmu.edu>. |
| Pseudocode | Yes | Algorithm 1 CVPO Training for One Epoch |
| Open Source Code | Yes | The code is available at https://github.com/ liuzuxin/cvpo-safe-rl. |
| Open Datasets | Yes | The task environment implementations are built upon Safety Gym (based on Mujoco) (Ray et al., 2019) and its Py Bullet implementation (Gronauer, 2022). |
| Dataset Splits | No | The paper describes experiments in reinforcement learning environments, which typically involve continuous interaction rather than static train/validation/test dataset splits. No explicit percentage or sample counts for validation splits are provided. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware used for running the experiments (e.g., GPU models, CPU types, or memory specifications). |
| Software Dependencies | No | The paper mentions software like |
| Experiment Setup | Yes | The hyperparameters are shown in Table 1. More details can be found in the code. Common Hyperparameters CVPO Hyperparameter Policy network sizes [256, 256] Q network sizes [256, 256] Network activation Re LU Discount factor gamma γ 0.99 Polyak weight ρ: 0.995 Batch size B: 300 Rollout trajectory number T 20 Critics learning rate αc 0.001 NN Optimizer Adam Particle size K 32 M-step iterations M 6 Learning rate αµ 1 Learning rate αΣ 100 Learning rate αθ 0.002 E-step KL threshold ϵ2: 0.1 M-step KL threshold ϵµ: 0.001 M-step KL threshold ϵΣ: 0.0001 E-step solver SLSQP |