RTFM: Generalising to New Environment Dynamics via Reading
Authors: Victor Zhong, Tim Rocktäschel, Edward Grefenstette
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose a grounded policy learning problem, Read to Fight Monsters (RTFM), in which the agent must jointly reason over a language goal, relevant dynamics described in a document, and environment observations. We procedurally generate environment dynamics and corresponding language descriptions of the dynamics, such that agents must read to understand new environment dynamics instead of memorising any particular information. In addition, we propose txt2π, a model that captures three-way interactions between the goal, document, and observations. On RTFM, txt2π generalises to new environments with dynamics not seen during training via reading. Furthermore, our model outperforms baselines such as Fi LM and language-conditioned CNNs on RTFM. |
| Researcher Affiliation | Collaboration | Victor Zhong Paul G. Allen School of Computer Science & Engineering University of Washington vzhong@cs.washington.edu Tim Rockt aschel Facebook AI Research & University College London rockt@fb.com Edward Grefenstette Facebook AI Research & University College London egrefen@fb.com |
| Pseudocode | No | The paper describes the model architecture with mathematical equations and diagrams (Figure 2, Figure 3), but it does not include pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper states: "We train using Torch Beast (K uttler et al., 2019), an implementation of IMPALA (Espeholt et al., 2018). Please refer to appendix D for details." and later cites: "Torch Beast: A Py Torch Platform for Distributed RL. ar Xiv preprint ar Xiv:1910.03552, 2019. URL https://github.com/ facebookresearch/torchbeast." This refers to the platform they used, not their specific implementation of txt2π. A link to |
| Open Datasets | No | To necessitate reading comprehension, we expose the agent to ever changing environment dynamics and corresponding language descriptions such that it cannot avoid reading by memorising any particular environment dynamics. We procedurally generate environment dynamics and natural language templated descriptions of dynamics and goals to produced a combinatorially large number of environment dynamics to train and evaluate RTFM. |
| Dataset Splits | Yes | We split environments into train and eval sets. No assignments of monster-team-modifier-element are shared between train and eval to test whether the agent is able to generalise to new environments with dynamics not seen during training via reading. There are more than 2 million train or eval environments without considering the natural language templates, and 200 million otherwise. With random ordering of templates, the number of unique documents exceeds 15 billion. ... We train on one set of dynamics (e.g. group assignments of monsters and modifiers) and evaluated on a held-out set of dynamics. ... Table 5: Statistics of the three variations of the Rock-paper-scissors task [showing train, dev, unseen splits] |
| Hardware Specification | No | The paper mentions using "Torch Beast" as the training implementation but does not provide any specific hardware details such as GPU or CPU models, or memory specifications used for running the experiments. |
| Software Dependencies | No | We train using Torch Beast (K uttler et al., 2019), an implementation of IMPALA (Espeholt et al., 2018). |
| Experiment Setup | Yes | We train using an implementation of IMPALA (Espeholt et al., 2018). In particular, we use 20 actors and a batch size of 24. When unrolling actors, we use a maximum unroll length of 80 frames. Each episode lasts for a maximum of 1000 frames. We optimise using RMSProp (Tieleman & Hinton, 2012) with a learning rate of 0.005, which is annealed linearly for 100 million frames. We set α = 0.99 and ϵ = 0.01. During training, we apply a small negative reward for each time step of 0.02 and a discount factor of 0.99 to facilitate convergence. We additionally include a entropy cost to encourage exploration. ... In addition to policy gradient, we add in the entropy loss with a weight of 0.005 and the baseline loss with a weight of 0.5. ... When tuning models, we perform a grid search using the training environments to select hyperparameters for each model. We train 5 runs for each configuration in order to report the mean and standard deviation. |