On the Effectiveness of Fine-tuning Versus Meta-reinforcement Learning
Authors: Mandi Zhao, Pieter Abbeel, Stephen James
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We therefore investigate meta-RL approaches in a variety of vision-based benchmarks, including Procgen, RLBench, and Atari, where evaluations are made on completely novel tasks. |
| Researcher Affiliation | Academia | Zhao Mandi Pieter Abbeel Stephen James {mandi.zhao, pabbeel, stepjam}@berkeley.edu University of California, Berkeley |
| Pseudocode | Yes | Algorithm 1 Training On-policy Meta-RL |
| Open Source Code | No | The paper provides a project website link (https://sites.google.com/berkeley.edu/finetune-vs-metarl) but does not explicitly state that the source code for the methodology is available there, nor does it provide a direct link to a code repository within the paper itself. |
| Open Datasets | Yes | We use 3 existing RL benchmarks: Procgen [13], RLBench [8], and Atari Learning Environment (ALE) [14, 15] each offers a diverse set of distinct tasks that we use for train and test. |
| Dataset Splits | No | The paper describes train and test sets (e.g., '10,000 levels for training, and held-out 20 levels for testing' for Procgen, and '10 tasks' for training and '5 unseen tasks' for testing in RLBench/Atari) but does not explicitly mention a separate validation set for hyperparameter tuning or model selection. |
| Hardware Specification | Yes | All experiments in the study are trained and evaluated on a maximum of 8 RTX A5000 GPUs, each with 24GB of memory. |
| Software Dependencies | No | The paper mentions various base RL algorithms (PPO, C2F-ARM, Rainbow DQN) and architectures (IMPALA encoder) along with their corresponding research papers, but it does not specify software package names with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x, or specific library versions) that are critical for reproducibility. |
| Experiment Setup | Yes | For each iteration in the inner loop, the environment is fixed to one randomly sampled task level, where rollouts are collected and used to perform k iterations of batched gradient updates (we use k = 3 in our experiments). 100 million environment steps are used for all pretraining runs, which amounts to 1500 PPO iterations. Then the final model checkpoint is used to test adaptation the environment is run in parallel on 256 threads and each PPO iteration uses 256 steps. For Reptile-PPO and MT-PPO, a trained agent is finetuned with vanilla PPO for 2 million environment steps on each test level. |