Safe Reinforcement Learning using Finite-Horizon Gradient-based Estimation

Authors: Juntao Dai, Yaodong Yang, Qian Zheng, Gang Pan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results reveal that CGPO, unlike baseline algorithms, successfully estimates the constraint functions of subsequent policies, thereby ensuring the efficiency and feasibility of each update.
Researcher Affiliation Academia Juntao Dai 1 2 Yaodong Yang 3 Qian Zheng 1 2 Gang Pan 1 2 1College of Computer Science and Technology, Zhejiang University, Hangzhou, China 2The State Key Lab of Brain-Machine Intelligence, Zhejiang University, Hangzhou, China 3Center for AI Safety and Governance, Peking University, Beijing, China.
Pseudocode Yes Algorithm 1 Constrained Gradient-based Policy Optimization (CGPO) ... Algorithm 2 Dual variable solver in the case of solvability. ... Algorithm 3 Model-based Constrained Gradient-based Policy Optimization (MB-CGPO)
Open Source Code No The paper does not provide a direct link to the source code for the proposed method nor explicitly state its release for the work described.
Open Datasets No We develop a series of constrained differentiable tasks on an open-source differentiable physics engine Brax (Freeman et al., 2021). These tasks are based on four differentiable robotic control tasks in Brax (Cart Pole, Reacher, Half Cheetah, and Ant), with the addition of two common constraints: limiting position(Achiam et al., 2017; Ji et al., 2023b) and limiting velocity(Zhang et al., 2020).
Dataset Splits No The paper mentions 'num epochs', 'episode length', and 'mini batch size' in its hyperparameters (Table 4, Section F.3), but does not specify dataset splits (e.g., 80/10/10) for training, validation, or testing.
Hardware Specification No The paper does not provide any specific hardware details (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using 'Brax' as a differentiable physics engine and implicitly uses frameworks for 'deep Safe RL' (likely PyTorch or TensorFlow), but it does not specify version numbers for any of the software dependencies.
Experiment Setup Yes Table 4. Hyper-parameters for CGPO in different tasks. ... num epochs 100 ... num envs 128 ... episode length 300 ... delta init 1e-3 ... critic learning rate 1e-3 ... short horizon 10 ... mini batch size 64 ...