Improving Policy Learning via Language Dynamics Distillation

Authors: Victor Zhong, Jesse Mu, Luke Zettlemoyer, Edward Grefenstette, Tim Rocktäschel

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate LDD on the recent SILG benchmark [Zhong et al., 2021], which consists of five diverse environments with language descriptions including Net Hack [Küttler et al., 2020], ALFWorld [Shridhar et al., 2021], RTFM [Zhong et al., 2020], Messenger [Hanjie et al., 2021], and Touchdown [Chen et al., 2018]. By learning a dynamics model from cheaply obtained unlabeled demonstrations, LDD consistently outperforms reinforcement learning with language descriptions both in terms of sample efficiency and generalization performance. Moreover, we compare LDD to other techniques that inject prior knowledge in VAE pretraining [Kingma and Welling, 2013], inverse reinforcement learning [Hanna and Stone, 2017, Torabi et al., 2018], and reward shaping with a pretrained expert [Merel et al., 2017]. LDD achieves top performance on all environments in terms of task completion and reward. In addition to comparing LDD to other methods, we ablate LDD to quantify the effect of language observations in dynamics modeling, and the importance of dynamics modeling with expert demonstrations.
Researcher Affiliation Collaboration Victor Zhong1,2, Jesse Mu3, Luke Zettlemoyer1,2, Edward Grefenstette4,5 and Tim Rocktäschel4 1University of Washington 2Meta AI Research 3Stanford University 4University College London
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No For the anonymity of the review process, we will release the code to reproduce the experiments on acceptance.
Open Datasets Yes We evaluate Language Dynamics Distillation on the Situated Interactive Language Grounding benchmark (SILG) [Zhong et al., 2021]. SILG consists of five different language grounding environments with diverse challenges in term of complexity of observation space, action space, language, and reasoning procedure... For Net Hack, we use 100k screen-recordings (where actions are not annotated and cannot be trivially reverse engineered due to ambiguity in observations) of human-playthroughs from the alt.org Net Hack public server.
Dataset Splits No The paper refers to "appendix G" for details on environment splits, but does not provide specific train/validation/test dataset split information within the main text.
Hardware Specification Yes All experiments were run on internal Meta AI computing clusters and utilized NVIDIA V100 GPUs.
Software Dependencies No The paper mentions software like moolib, Torchbeast, and PPO, but does not provide specific version numbers for these or other key software dependencies.
Experiment Setup No The paper refers to "appendix G" for details on hyperparameters and training details, but these are not explicitly provided within the main text.