Reachability Constrained Reinforcement Learning

Authors: Dongjie Yu, Haitong Ma, Shengbo Li, Jianyu Chen

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results on different benchmarks validate the learned feasible set, the policy performance, and constraint satisfaction of RCRL, compared to CRL and safe control baselines.
Researcher Affiliation Collaboration 1School of Vehicle and Mobility, Tsinghua University, Beijing, China 2John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, USA. 3Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China 4Shanghai Qizhi Institute, Shanghai, China.
Pseudocode Yes Algorithm 1 provides the pseudo-code of an actor-critic version of RCRL. A policy-gradient version of RCRL is designed similarly in Algorithm 2.
Open Source Code No The paper does not contain an explicit statement or link indicating the public release of the source code for the methodology described.
Open Datasets Yes Benchmarks. We implement both onand off-policy RCRL and compare them with different CRL baselines. Experiments include that: (1) use double-integrator (Fisac et al., 2019) which has an analytical solution to check the correctness of feasible set learned by RCRL; (2) validate the scalability of RCRL to nonlinear control problems, specifically, a 2D quadrotor trajectory tracking task in safe-control-gym (Yuan et al., 2021), and (3) classical safe learning benchmark Safety-Gym (Achiam & Amodei, 2019).
Dataset Splits No The paper describes training and evaluation procedures, including averaging results over runs and specific initialization for evaluation, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts, as is typical for static datasets.
Hardware Specification No The paper does not provide specific hardware details (such as GPU or CPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions software components like 'Adam' optimizer, 'SAC', and 'PPO' and 'Multi-layer Perceptron' but does not provide specific version numbers for these software dependencies or the programming language used.
Experiment Setup Yes Table 1 and Table 2 provide detailed hyperparameters for both off-policy and on-policy algorithms, including optimizer settings (Adam β1, β2), network architecture (number of hidden layers, neurons), learning rates, discount factors, batch sizes, and more. Appendix D.1 also specifies initialization ranges for variables in the quadrotor experiment.