On the Effectiveness of Fine-tuning Versus Meta-reinforcement Learning

Authors: Mandi Zhao, Pieter Abbeel, Stephen James

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We therefore investigate meta-RL approaches in a variety of vision-based benchmarks, including Procgen, RLBench, and Atari, where evaluations are made on completely novel tasks.
Researcher Affiliation Academia Zhao Mandi Pieter Abbeel Stephen James {mandi.zhao, pabbeel, stepjam}@berkeley.edu University of California, Berkeley
Pseudocode Yes Algorithm 1 Training On-policy Meta-RL
Open Source Code No The paper provides a project website link (https://sites.google.com/berkeley.edu/finetune-vs-metarl) but does not explicitly state that the source code for the methodology is available there, nor does it provide a direct link to a code repository within the paper itself.
Open Datasets Yes We use 3 existing RL benchmarks: Procgen [13], RLBench [8], and Atari Learning Environment (ALE) [14, 15] each offers a diverse set of distinct tasks that we use for train and test.
Dataset Splits No The paper describes train and test sets (e.g., '10,000 levels for training, and held-out 20 levels for testing' for Procgen, and '10 tasks' for training and '5 unseen tasks' for testing in RLBench/Atari) but does not explicitly mention a separate validation set for hyperparameter tuning or model selection.
Hardware Specification Yes All experiments in the study are trained and evaluated on a maximum of 8 RTX A5000 GPUs, each with 24GB of memory.
Software Dependencies No The paper mentions various base RL algorithms (PPO, C2F-ARM, Rainbow DQN) and architectures (IMPALA encoder) along with their corresponding research papers, but it does not specify software package names with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x, or specific library versions) that are critical for reproducibility.
Experiment Setup Yes For each iteration in the inner loop, the environment is fixed to one randomly sampled task level, where rollouts are collected and used to perform k iterations of batched gradient updates (we use k = 3 in our experiments). 100 million environment steps are used for all pretraining runs, which amounts to 1500 PPO iterations. Then the final model checkpoint is used to test adaptation the environment is run in parallel on 256 threads and each PPO iteration uses 256 steps. For Reptile-PPO and MT-PPO, a trained agent is finetuned with vanilla PPO for 2 million environment steps on each test level.