Reachability Constrained Reinforcement Learning
Authors: Dongjie Yu, Haitong Ma, Shengbo Li, Jianyu Chen
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on different benchmarks validate the learned feasible set, the policy performance, and constraint satisfaction of RCRL, compared to CRL and safe control baselines. |
| Researcher Affiliation | Collaboration | 1School of Vehicle and Mobility, Tsinghua University, Beijing, China 2John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, USA. 3Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China 4Shanghai Qizhi Institute, Shanghai, China. |
| Pseudocode | Yes | Algorithm 1 provides the pseudo-code of an actor-critic version of RCRL. A policy-gradient version of RCRL is designed similarly in Algorithm 2. |
| Open Source Code | No | The paper does not contain an explicit statement or link indicating the public release of the source code for the methodology described. |
| Open Datasets | Yes | Benchmarks. We implement both onand off-policy RCRL and compare them with different CRL baselines. Experiments include that: (1) use double-integrator (Fisac et al., 2019) which has an analytical solution to check the correctness of feasible set learned by RCRL; (2) validate the scalability of RCRL to nonlinear control problems, specifically, a 2D quadrotor trajectory tracking task in safe-control-gym (Yuan et al., 2021), and (3) classical safe learning benchmark Safety-Gym (Achiam & Amodei, 2019). |
| Dataset Splits | No | The paper describes training and evaluation procedures, including averaging results over runs and specific initialization for evaluation, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts, as is typical for static datasets. |
| Hardware Specification | No | The paper does not provide specific hardware details (such as GPU or CPU models, memory, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like 'Adam' optimizer, 'SAC', and 'PPO' and 'Multi-layer Perceptron' but does not provide specific version numbers for these software dependencies or the programming language used. |
| Experiment Setup | Yes | Table 1 and Table 2 provide detailed hyperparameters for both off-policy and on-policy algorithms, including optimizer settings (Adam β1, β2), network architecture (number of hidden layers, neurons), learning rates, discount factors, batch sizes, and more. Appendix D.1 also specifies initialization ranges for variables in the quadrotor experiment. |