reproducibilityindex.ai

On the Effectiveness of Fine-tuning Versus Meta-reinforcement Learning

Authors: Mandi Zhao, Pieter Abbeel, Stephen James

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We therefore investigate meta-RL approaches in a variety of vision-based benchmarks, including Procgen, RLBench, and Atari, where evaluations are made on completely novel tasks.
Researcher Affiliation	Academia	Zhao Mandi Pieter Abbeel Stephen James {mandi.zhao, pabbeel, stepjam}@berkeley.edu University of California, Berkeley
Pseudocode	Yes	Algorithm 1 Training On-policy Meta-RL
Open Source Code	No	The paper provides a project website link (https://sites.google.com/berkeley.edu/ﬁnetune-vs-metarl) but does not explicitly state that the source code for the methodology is available there, nor does it provide a direct link to a code repository within the paper itself.
Open Datasets	Yes	We use 3 existing RL benchmarks: Procgen [13], RLBench [8], and Atari Learning Environment (ALE) [14, 15] each offers a diverse set of distinct tasks that we use for train and test.
Dataset Splits	No	The paper describes train and test sets (e.g., '10,000 levels for training, and held-out 20 levels for testing' for Procgen, and '10 tasks' for training and '5 unseen tasks' for testing in RLBench/Atari) but does not explicitly mention a separate validation set for hyperparameter tuning or model selection.
Hardware Specification	Yes	All experiments in the study are trained and evaluated on a maximum of 8 RTX A5000 GPUs, each with 24GB of memory.
Software Dependencies	No	The paper mentions various base RL algorithms (PPO, C2F-ARM, Rainbow DQN) and architectures (IMPALA encoder) along with their corresponding research papers, but it does not specify software package names with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x, or specific library versions) that are critical for reproducibility.
Experiment Setup	Yes	For each iteration in the inner loop, the environment is ﬁxed to one randomly sampled task level, where rollouts are collected and used to perform k iterations of batched gradient updates (we use k = 3 in our experiments). 100 million environment steps are used for all pretraining runs, which amounts to 1500 PPO iterations. Then the ﬁnal model checkpoint is used to test adaptation the environment is run in parallel on 256 threads and each PPO iteration uses 256 steps. For Reptile-PPO and MT-PPO, a trained agent is ﬁnetuned with vanilla PPO for 2 million environment steps on each test level.