Offline Meta-Reinforcement Learning with Online Self-Supervision
Authors: Vitchyr H Pong, Ashvin V Nair, Laura M Smith, Catherine Huang, Sergey Levine
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare our method to prior work on offline meta-RL on simulated robot locomotion and manipulation tasks and find that using additional unsupervised online data collection leads to a dramatic improvement in the adaptive capabilities of the meta-trained policies, matching the performance of fully online meta RL on a range of challenging domains that require generalization to new tasks. ... We evaluate our method and prior offline meta-RL methods on a number of benchmarks ... We find that, while standard meta-RL methods perform well at adapting to training tasks, they suffer from data-distribution shifts when adapting to new tasks. In contrast, our method attains significantly better performance, on par with an online meta-RL method that receives fully labeled online interaction data. |
| Researcher Affiliation | Academia | 1University of California, Berkeley. Correspondence to: Vitchyr H. Pong <vitchyr@eecs.berkeley.edu>. |
| Pseudocode | Yes | A. Method Pseudo-code. We present the pseudo-code for SMAC in Algorithm 1. Algorithm 1 Semi-Supervised Meta Actor-Critic |
| Open Source Code | No | The paper refers to open-sourced code for *prior works* (PEARL, BORe L, MACAW) that were used for comparison, but does not provide a statement or link to the source code for the proposed method (SMAC). |
| Open Datasets | Yes | We first evaluate our method on multiple simulated Mu Jo Co (Todorov et al., 2012) meta-learning tasks... We also evaluated SMAC on a significantly more diverse robot manipulation meta-learning task called Sawyer Manipulation, based on the goal-conditioned environment introduced by Khazatsky et al. (2021). This is a simulated Py Bullet environment (Coumans & Bai, 2016 2021). ... For the Mu Jo Co tasks, we use the replay buffer from a single PEARL run with ground-truth reward. ... The offline data is collected by running PEARL (Rakelly et al., 2019) on this meta RL task with 100 pre-sampled 2 target velocities. |
| Dataset Splits | No | The paper discusses meta-training on a set of tasks and meta-testing on held-out tasks. However, it does not provide specific training/validation/test data splits (e.g., percentages or sample counts) for the datasets themselves. It mentions "number of test tasks" but not data splits within those tasks. |
| Hardware Specification | No | The paper does not provide any specific hardware specifications (e.g., CPU/GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions simulation environments like "Mu Jo Co (Todorov et al., 2012)" and "Py Bullet (Coumans & Bai, 2016 2021)" with years, but it does not specify exact version numbers for these or any other software libraries, frameworks, or programming languages. |
| Experiment Setup | Yes | Table 1: SMAC Hyperparameters for Self-Supervised Phase. Includes 'RL batch size 256', 'encoder batch size 64', 'meta batch size 4', 'Q-network hidden sizes [300, 300, 300]', 'policy network hidden sizes [300, 300, 300]', 'learning rate 3 10 4', etc. Table 2: Environment Specific SMAC Hyperparameters. |