Offline Meta-Reinforcement Learning with Online Self-Supervision

Authors: Vitchyr H Pong, Ashvin V Nair, Laura M Smith, Catherine Huang, Sergey Levine

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare our method to prior work on offline meta-RL on simulated robot locomotion and manipulation tasks and find that using additional unsupervised online data collection leads to a dramatic improvement in the adaptive capabilities of the meta-trained policies, matching the performance of fully online meta RL on a range of challenging domains that require generalization to new tasks. ... We evaluate our method and prior offline meta-RL methods on a number of benchmarks ... We find that, while standard meta-RL methods perform well at adapting to training tasks, they suffer from data-distribution shifts when adapting to new tasks. In contrast, our method attains significantly better performance, on par with an online meta-RL method that receives fully labeled online interaction data.
Researcher Affiliation Academia 1University of California, Berkeley. Correspondence to: Vitchyr H. Pong <vitchyr@eecs.berkeley.edu>.
Pseudocode Yes A. Method Pseudo-code. We present the pseudo-code for SMAC in Algorithm 1. Algorithm 1 Semi-Supervised Meta Actor-Critic
Open Source Code No The paper refers to open-sourced code for *prior works* (PEARL, BORe L, MACAW) that were used for comparison, but does not provide a statement or link to the source code for the proposed method (SMAC).
Open Datasets Yes We first evaluate our method on multiple simulated Mu Jo Co (Todorov et al., 2012) meta-learning tasks... We also evaluated SMAC on a significantly more diverse robot manipulation meta-learning task called Sawyer Manipulation, based on the goal-conditioned environment introduced by Khazatsky et al. (2021). This is a simulated Py Bullet environment (Coumans & Bai, 2016 2021). ... For the Mu Jo Co tasks, we use the replay buffer from a single PEARL run with ground-truth reward. ... The offline data is collected by running PEARL (Rakelly et al., 2019) on this meta RL task with 100 pre-sampled 2 target velocities.
Dataset Splits No The paper discusses meta-training on a set of tasks and meta-testing on held-out tasks. However, it does not provide specific training/validation/test data splits (e.g., percentages or sample counts) for the datasets themselves. It mentions "number of test tasks" but not data splits within those tasks.
Hardware Specification No The paper does not provide any specific hardware specifications (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions simulation environments like "Mu Jo Co (Todorov et al., 2012)" and "Py Bullet (Coumans & Bai, 2016 2021)" with years, but it does not specify exact version numbers for these or any other software libraries, frameworks, or programming languages.
Experiment Setup Yes Table 1: SMAC Hyperparameters for Self-Supervised Phase. Includes 'RL batch size 256', 'encoder batch size 64', 'meta batch size 4', 'Q-network hidden sizes [300, 300, 300]', 'policy network hidden sizes [300, 300, 300]', 'learning rate 3 10 4', etc. Table 2: Environment Specific SMAC Hyperparameters.