Transformers are Meta-Reinforcement Learners
Authors: Luckeciano C Melo
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted experiments in high-dimensional continuous control environments for locomotion and dexterous manipulation. Results show that Tr MRL presents comparable or superior asymptotic performance, sample efficiency, and out-of-distribution generalization compared to the baselines in these environments. 5. Experiments and Discussion |
| Researcher Affiliation | Collaboration | Luckeciano C. Melo 1 2 1Microsoft, USA 2Center of Excellence in Artificial Intelligence (Deep Learning Brazil), Brazil. |
| Pseudocode | Yes | G. Pseudocode Algorithm 1 Tr MRL Forward Pass |
| Open Source Code | Yes | For reproducibility (source code and hyperparameters), we refer to the released source code at https://github.com/luckeciano/transformers-metarl. To ensure the reproducibility of our research, we released all the source code associated with our models and experimental pipeline. We refer to the supplementary material of this submission. It also includes the hyperparameters and the scripts to execute all the scenarios presented in this paper. |
| Open Datasets | Yes | We considered high-dimensional, continuous control tasks for locomotion (Mu Jo Co) and dexterous manipulation (Meta World). Appendix B.1. Mu Jo Co Locomotion Tasks. Appendix B.2. Meta World (Yu et al., 2021) benchmark. |
| Dataset Splits | No | The paper defines task splits for meta-training and meta-testing but does not specify a distinct validation dataset split with percentages or sample counts. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | The paper mentions software like PPO, Adam optimizer, T-Fixup, Mu Jo Co, and Meta World but does not specify version numbers for any of these components or underlying libraries/frameworks. |
| Experiment Setup | Yes | During meta-training, we repeatedly sampled a batch of tasks to collect experience with the goal of learning to learn. For each task, we ran a sequence of E episodes. During the interaction, the agent conducted exploration with a gaussian policy. During optimization, we concatenate these episodes to form a single trajectory and we maximize the discounted cumulative reward of this trajectory. For these experiments, we considered E = 2. We performed this training via Proximal Policy Optimization (PPO) (Schulman et al., 2017), and the data batches mixed different tasks. To stabilize transformer training, we used the T-Fixup as a weight initialization scheme. Working Memory Sequence Length (N = 1, 5, 10, 20, 50). Number of Layers (1, 4, 8, 12). Number of Attention Heads (1, 2, 4). |