Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Offline Meta-Reinforcement Learning with Online Self-Supervision

Authors: Vitchyr H Pong, Ashvin V Nair, Laura M Smith, Catherine Huang, Sergey Levine

ICML 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare our method to prior work on ofﬂine meta-RL on simulated robot locomotion and manipulation tasks and ﬁnd that using additional unsupervised online data collection leads to a dramatic improvement in the adaptive capabilities of the meta-trained policies, matching the performance of fully online meta RL on a range of challenging domains that require generalization to new tasks. ... We evaluate our method and prior ofﬂine meta-RL methods on a number of benchmarks ... We ﬁnd that, while standard meta-RL methods perform well at adapting to training tasks, they suffer from data-distribution shifts when adapting to new tasks. In contrast, our method attains signiﬁcantly better performance, on par with an online meta-RL method that receives fully labeled online interaction data.
Researcher Affiliation	Academia	1University of California, Berkeley. Correspondence to: Vitchyr H. Pong <EMAIL>.
Pseudocode	Yes	A. Method Pseudo-code. We present the pseudo-code for SMAC in Algorithm 1. Algorithm 1 Semi-Supervised Meta Actor-Critic
Open Source Code	No	The paper refers to open-sourced code for prior works (PEARL, BORe L, MACAW) that were used for comparison, but does not provide a statement or link to the source code for the proposed method (SMAC).
Open Datasets	Yes	We ﬁrst evaluate our method on multiple simulated Mu Jo Co (Todorov et al., 2012) meta-learning tasks... We also evaluated SMAC on a signiﬁcantly more diverse robot manipulation meta-learning task called Sawyer Manipulation, based on the goal-conditioned environment introduced by Khazatsky et al. (2021). This is a simulated Py Bullet environment (Coumans & Bai, 2016 2021). ... For the Mu Jo Co tasks, we use the replay buffer from a single PEARL run with ground-truth reward. ... The ofﬂine data is collected by running PEARL (Rakelly et al., 2019) on this meta RL task with 100 pre-sampled 2 target velocities.
Dataset Splits	No	The paper discusses meta-training on a set of tasks and meta-testing on held-out tasks. However, it does not provide specific training/validation/test data splits (e.g., percentages or sample counts) for the datasets themselves. It mentions "number of test tasks" but not data splits within those tasks.
Hardware Specification	No	The paper does not provide any specific hardware specifications (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions simulation environments like "Mu Jo Co (Todorov et al., 2012)" and "Py Bullet (Coumans & Bai, 2016 2021)" with years, but it does not specify exact version numbers for these or any other software libraries, frameworks, or programming languages.
Experiment Setup	Yes	Table 1: SMAC Hyperparameters for Self-Supervised Phase. Includes 'RL batch size 256', 'encoder batch size 64', 'meta batch size 4', 'Q-network hidden sizes [300, 300, 300]', 'policy network hidden sizes [300, 300, 300]', 'learning rate 3 10 4', etc. Table 2: Environment Speciﬁc SMAC Hyperparameters.