Gated-Attention Architectures for Task-Oriented Language Grounding
Authors: Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, Ruslan Salakhutdinov
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show the effectiveness of the proposed model on unseen instructions as well as unseen maps, both quantitatively and qualitatively. We also introduce a novel environment based on a 3D game engine to simulate the challenges of task-oriented language grounding over a rich set of instructions and environment states. For all the models described in section , the performance on both Multitask and Zero-shot Generalization is shown in Table 1. The performance of A3C models on Multitask Generalization during training is plotted in Figure 6. |
| Researcher Affiliation | Academia | Devendra Singh Chaplot chaplot@cs.cmu.edu School of Computer Science Carnegie Mellon University Kanthashree Mysore Sathyendra ksathyen@cs.cmu.edu School of Computer Science Carnegie Mellon University Rama Kumar Pasumarthi rpasumar@cs.cmu.edu School of Computer Science Carnegie Mellon University Dheeraj Rajagopal dheeraj@cs.cmu.edu School of Computer Science Carnegie Mellon University Ruslan Salakhutdinov rsalakhu@cs.cmu.edu School of Computer Science Carnegie Mellon University |
| Pseudocode | No | The paper describes the architecture and algorithms using text and diagrams (Figure 2, 3, 4) but does not provide formal pseudocode blocks. |
| Open Source Code | Yes | 1The code for the environment and the proposed model is available at https://github.com/devendrachaplot/Deep RL-Grounding |
| Open Datasets | No | The paper introduces a new environment built over Vi ZDoom and states, "The customizable nature of the environment enables us to create scenarios with varying levels of difficulty..." and "We provide a set of 70 manually generated instructions1.", but it does not provide access information (URL, DOI, or full citation with author/year) for the dataset of instructions or environments themselves. It refers to "See the list of objects and instructions at https://goo.gl/r PWl My" in a footnote, but this URL leads to a |
| Dataset Splits | No | The paper mentions using a 'test set' for zero-shot evaluation, but it does not specify explicit training, validation, or test splits by percentages or counts. It only states that 55 instructions were used for training and 15 for testing, without specifying how the data for each instruction was split or how cross-validation was handled. |
| Hardware Specification | No | The paper does not provide specific hardware details like GPU or CPU models used for the experiments. |
| Software Dependencies | No | The paper mentions |
| Experiment Setup | Yes | The input to the neural network is the instruction and an RGB image of size 3x300x168. The first layer convolves the image with 128 filters of 8x8 kernel size with stride 4, followed by 64 filters of 4x4 kernel size with stride 2 and another 64 filters of 4x4 kernel size with stride 2. The architecture of the convolutional layers is adapted from previous work on playing deathmatches in Doom (Chaplot and Lample 2017). The input instruction is encoded through a Gated Recurrent Unit (GRU) (Chung et al. 2014) of size 256. For the imitation learning approach, we run experiments with Behavioral Cloning (BC) and DAgger algorithms in an online fashion, which have data generation and policy update function per outer iteration. The policy learner for imitation learning comprises of a linear layer of size 512 which is fullyconnected to 3 neurons to predict the policy function (i.e. probability of each action). In each data generation step, we sample state trajectories based on oracle s policy in BC and based on a mixture of oracle s policy and the currently learned policy in DAgger. The mixing of the policies is governed by an exploration coefficient, which has a linear decay from 1 to 0. For each state, we collect the optimal action given by the policy oracle. Then the policy is updated for 10 epochs over all the state-action pairs collected so far, using the RMSProp optimizer (Tieleman and Hinton 2012). Both methods use Huber loss (Huber 1964) between the estimated policy and the optimal policy given by the policy oracle. For reinforcement learning, we run experiments with A3C algorithm. The policy learning module has a linear layer of size 256 followed by an LSTM layer of size 256 which encodes the history of state observations. The LSTM layer s output is fully-connected to a single neuron to predict the value function as well as three other neurons to predict the policy function. All the convolutional layers and fully-connected linear layers have Re Lu activations (Nair and Hinton 2010). The A3C model was trained using Stochastic Gradient Descent (SGD) with a learning rate of 0.001. We used a discount factor of 0.99 for calculating expected rewards and run 16 parallel threads for each experiment. |