Robust Policy Learning over Multiple Uncertainty Sets

Authors: Annie Xie, Shagun Sodhani, Chelsea Finn, Joelle Pineau, Amy Zhang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We design several experiments to understand the effectiveness of our proposed approach compared to system identification and robust RL approaches in unseen environments.
Researcher Affiliation Collaboration 1Stanford University 2Facebook AI Research.
Pseudocode Yes Algorithm 1 System Identification and Risk-Sensitive Adaptation (SIRSA)
Open Source Code Yes Code and videos of our results are on our webpage: https: //sites.google.com/view/sirsa-public/home.
Open Datasets Yes Half-cheetah (Brockman et al., 2016); Peg insertion (Zhao et al., 2020; Schoettler et al., 2020). We design several environments to evaluate our approach, and in each, vary one or more parameters that affect the dynamics and/or reward function.
Dataset Splits No The paper describes its training process using replay buffers and test-time evaluation, but does not explicitly mention the use of a separate validation set or validation split for hyperparameter tuning or early stopping.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using Soft Actor-Critic (SAC) and REDQ, which are algorithms/frameworks, but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes System identification model. We train an ensemble of B = 4 models, which are MLPs with 2 fully-connected layers of size 64 in the Point Mass domain; 2 fully-connected laters of size 256 in all other domains. Policy and critic networks. The policy and critic networks are MLPs with 2 fully-connected layers of size 64 in the Point Mass domain; 2 fully-connected layers of size 256 in all other domains. CVa R approximation. In our experiments, we use N = 50 CVa R samples to approximate the gradient of the CVa R. Training phases. In Point Mass, we optimize the SAC objectives for 25K iterations then optimize the CVa R for another 25K iterations, for a total of 50K training iterations. In the Minitaur and Peg Insertion domains, we pre-train for 150K iterations then optimize CVa R for 150K iterations for a total of 300K. In Half-Cheetah, the pre-training is 2.5M, and the the CVa R optimization is 0.5M long, for a total of 3M steps.