Investigating Multi-task Pretraining and Generalization in Reinforcement Learning

Authors: Adrien Ali Taiga, Rishabh Agarwal, Jesse Farebrother, Aaron Courville, Marc G Bellemare

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We find that, given a fixed amount of pretraining data, agents trained with more variations are able to generalize better. Surprisingly, we also observe that this advantage can still be present after fine-tuning for 200M environment frames than when doing zero-shot transfer. This highlights the potential effect of a good learned representation. We also find that, even though small networks have remained popular to solve Atari 2600 games, increasing the capacity of the value and policy network is critical to achieve good performance as we increase the number of pretraining modes and difficulties. 4 EXPERIMENTS Using the above methodology, we now present our results on multi-task pretraining and generalization using the game variants in ALE.
Researcher Affiliation Collaboration Adrien Ali Ta ıga MILA, Universit e de Montr eal Google Brain alitaiga@google.com Rishabh Agarwal MILA, Universit e de Montr eal Google Brain rishabhagarwal@google.com Jesse Farebrother Mila, Mc Gill University Google Brain farebroj@mila.quebec Aaron Courville MILA, Universit e de Montr eal aaron.courville@umontreal.ca Marc G. Bellemare Google Brain bellemare@google.com
Pseudocode No The paper describes the methods used, such as IMPALA, but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code No Finally, in line with Agarwal et al. (2022), we would open-source our pretrained models to allow researchers to reproduce our findings as well as study zero-shot generalization and fine-tuning in RL, without the excessive cost of multi-task pretraining from scratch.
Open Datasets Yes Arcade Learning Environment: The Arcade Learning Environment (Bellemare et al., 2013) provides an interface to Atari 2600 games and was first proposed as a platform to evaluate the abilities of reinforcement learning agents across a wide variety of environments.
Dataset Splits No The paper mentions splitting variants into train and test sets, and fine-tuning procedures, but does not explicitly state a distinct 'validation' dataset split for hyperparameter tuning or model selection for reproduction.
Hardware Specification Yes The pretraining phase is carried on 32 TPU v3 cores using 6400 actors. Fine-tuning is done on 8 TPU v2 cores using 900 actors.
Software Dependencies No The paper mentions using Adam optimizer with specific parameters and an efficient implementation of IMPALA, but it does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes Table 1: Hyperparameters for Atari experiments. Parameter Value Image Width 84 Image Height 84 Grayscaling Yes Action Repetitions 4 Max-pool over last N action repeat frames 2 Frame Stacking 4 End of episode when life lost No Sticky actions Yes Action set Full (18 actions) Reward Clipping [-1, 1] Unroll Length (n) 19 Batch size 128 Discount (γ) 0.99 Baseline loss scaling 0.5 Entropy Regularizer 0.01 Adam β1 0.9 Adam β2 0.999 Adam ϵ 1e-8 Learning rate 3e-4 Clip global gradient norm 40