A CMDP-within-online framework for Meta-Safe Reinforcement Learning

Authors: Vanshaj Khattar, Yuhao Ding, Bilgehan Sel, Javad Lavaei, Ming Jin

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, experiments are conducted to demonstrate the effectiveness of our approach. (Abstract) In this section, we show the effectiveness of the proposed Meta-SRL framework and compare it with the following baselines: (Section 4)
Researcher Affiliation Academia Vanshaj Khattar Virginia Tech Blacksburg, VA 24061 vanshajk@vt.edu Yuhao Ding UC Berkeley Berkeley, CA 94709 yuhao_ding@berkeley.edu Bilgehan Sel Virginia Tech Blacksburg, VA 24061 bsel@vt.edu Javad Lavaei UC Berkeley Berkeley, CA 94709 lavaei@berkeley.edu Ming Jin Virginia Tech Blacksburg, VA 24061 jinming@vt.edu
Pseudocode Yes Algorithm 1: Inexact CMDP-within-online framework (exemplified with CRPO (Xu et al., 2021) as the within-task safe RL algorithm) (Page 4)
Open Source Code No No explicit statement about providing open-source code for their methodology or a link to a repository for their code. The paper mentions obtaining code for a third-party algorithm (CRPO) from one of the authors. (Section 7)
Open Datasets Yes We consider the Frozen lake, acrobot, half-Cheetah, and humanoid environments from the Open AI gym (Brockman et al., 2016) and Mu Joco Todorov et al. (2012) under constrained settings. (Section 4)
Dataset Splits No No explicit training/validation/test dataset splits (percentages or counts) or cross-validation setup are provided. It only mentions running experiments on a 'test task' and training for a certain number of steps on it.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory, or cloud computing instances) used for running experiments are mentioned in the paper.
Software Dependencies No The paper mentions software like 'Open AI gym' and 'Mu Joco', and a specific algorithm 'CRPO', but does not provide version numbers for any of these or any other software dependencies.
Experiment Setup Yes We train for 8 steps on the Frozen lake and for 5 steps on the Acrobot. In Frozen lake, each step corresponds to 5 episodes... (Section H) We choose the constraint threshold dt,i = 0.3. (Section H.1) The changes in these quantities were done by adding noise to the default quantities. We considered a Gaussian noise with a low variance of 0.1 to change the tasks only slightly. (Section H.2)