Towards Robust and Safe Reinforcement Learning with Benign Off-policy Data
Authors: Zuxin Liu, Zijian Guo, Zhepeng Cen, Huan Zhang, Yihang Yao, Hanjiang Hu, Ding Zhao
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multiple robot platforms show the efficiency of SAFER in learning a robust and safe policy: achieving the same reward with much fewer constraint violations during training than on-policy baselines. and 5. Experiment We consider two tasks (Run and Circle) and four robots (Ball, Car, Drone, and Ant) which have been used in many previous works as the testing ground (Achiam et al., 2017; Chow et al., 2019). |
| Researcher Affiliation | Academia | 1Carnegie Mellon University, PA, USA. |
| Pseudocode | Yes | Algo. 1 highlights the key steps of training the policy. (referring to Algorithm 1 SAFER Algorithm) and Algorithm 2 SAFER Algorithm and Algorithm 3 MC and MR attacker and Algorithm 4 SA-PPO-Lagrangian Algorithm and Algorithm 5 ADV-PPOL Algorithm and Algorithm 6 CVPO Algorithm. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It only mentions 'Video demos can be found in our website: https://sites.google.com/view/saferrl/home.' which is for demos, not code. |
| Open Datasets | Yes | The simulation environments are from a publicly available benchmark (Gronauer, 2022). |
| Dataset Splits | No | The paper describes using a replay buffer for sampling transitions during training but does not provide specific training/validation/test dataset splits (percentages, sample counts, or references to predefined splits) in the way a supervised learning paper might. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9) needed to replicate the experiment. |
| Experiment Setup | Yes | B.3. Experiment Setting and Hyper-parameters and Table 4: Hyperparameters for on-policy baselines (left) and off-policy baselines (right). (This table lists various specific hyperparameters like training epoch, batch size, particle size, M-step iterations, cost limit, perturbation ϵ, KL thresholds, learning rates, etc.) |