Transformers are Meta-Reinforcement Learners

Authors: Luckeciano C Melo

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted experiments in high-dimensional continuous control environments for locomotion and dexterous manipulation. Results show that Tr MRL presents comparable or superior asymptotic performance, sample efficiency, and out-of-distribution generalization compared to the baselines in these environments. 5. Experiments and Discussion
Researcher Affiliation Collaboration Luckeciano C. Melo 1 2 1Microsoft, USA 2Center of Excellence in Artificial Intelligence (Deep Learning Brazil), Brazil.
Pseudocode Yes G. Pseudocode Algorithm 1 Tr MRL Forward Pass
Open Source Code Yes For reproducibility (source code and hyperparameters), we refer to the released source code at https://github.com/luckeciano/transformers-metarl. To ensure the reproducibility of our research, we released all the source code associated with our models and experimental pipeline. We refer to the supplementary material of this submission. It also includes the hyperparameters and the scripts to execute all the scenarios presented in this paper.
Open Datasets Yes We considered high-dimensional, continuous control tasks for locomotion (Mu Jo Co) and dexterous manipulation (Meta World). Appendix B.1. Mu Jo Co Locomotion Tasks. Appendix B.2. Meta World (Yu et al., 2021) benchmark.
Dataset Splits No The paper defines task splits for meta-training and meta-testing but does not specify a distinct validation dataset split with percentages or sample counts.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper mentions software like PPO, Adam optimizer, T-Fixup, Mu Jo Co, and Meta World but does not specify version numbers for any of these components or underlying libraries/frameworks.
Experiment Setup Yes During meta-training, we repeatedly sampled a batch of tasks to collect experience with the goal of learning to learn. For each task, we ran a sequence of E episodes. During the interaction, the agent conducted exploration with a gaussian policy. During optimization, we concatenate these episodes to form a single trajectory and we maximize the discounted cumulative reward of this trajectory. For these experiments, we considered E = 2. We performed this training via Proximal Policy Optimization (PPO) (Schulman et al., 2017), and the data batches mixed different tasks. To stabilize transformer training, we used the T-Fixup as a weight initialization scheme. Working Memory Sequence Length (N = 1, 5, 10, 20, 50). Number of Layers (1, 4, 8, 12). Number of Attention Heads (1, 2, 4).