A CMDP-within-online framework for Meta-Safe Reinforcement Learning
Authors: Vanshaj Khattar, Yuhao Ding, Bilgehan Sel, Javad Lavaei, Ming Jin
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, experiments are conducted to demonstrate the effectiveness of our approach. (Abstract) In this section, we show the effectiveness of the proposed Meta-SRL framework and compare it with the following baselines: (Section 4) |
| Researcher Affiliation | Academia | Vanshaj Khattar Virginia Tech Blacksburg, VA 24061 vanshajk@vt.edu Yuhao Ding UC Berkeley Berkeley, CA 94709 yuhao_ding@berkeley.edu Bilgehan Sel Virginia Tech Blacksburg, VA 24061 bsel@vt.edu Javad Lavaei UC Berkeley Berkeley, CA 94709 lavaei@berkeley.edu Ming Jin Virginia Tech Blacksburg, VA 24061 jinming@vt.edu |
| Pseudocode | Yes | Algorithm 1: Inexact CMDP-within-online framework (exemplified with CRPO (Xu et al., 2021) as the within-task safe RL algorithm) (Page 4) |
| Open Source Code | No | No explicit statement about providing open-source code for their methodology or a link to a repository for their code. The paper mentions obtaining code for a third-party algorithm (CRPO) from one of the authors. (Section 7) |
| Open Datasets | Yes | We consider the Frozen lake, acrobot, half-Cheetah, and humanoid environments from the Open AI gym (Brockman et al., 2016) and Mu Joco Todorov et al. (2012) under constrained settings. (Section 4) |
| Dataset Splits | No | No explicit training/validation/test dataset splits (percentages or counts) or cross-validation setup are provided. It only mentions running experiments on a 'test task' and training for a certain number of steps on it. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory, or cloud computing instances) used for running experiments are mentioned in the paper. |
| Software Dependencies | No | The paper mentions software like 'Open AI gym' and 'Mu Joco', and a specific algorithm 'CRPO', but does not provide version numbers for any of these or any other software dependencies. |
| Experiment Setup | Yes | We train for 8 steps on the Frozen lake and for 5 steps on the Acrobot. In Frozen lake, each step corresponds to 5 episodes... (Section H) We choose the constraint threshold dt,i = 0.3. (Section H.1) The changes in these quantities were done by adding noise to the default quantities. We considered a Gaussian noise with a low variance of 0.1 to change the tasks only slightly. (Section H.2) |