Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
Authors: Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, Caiming Xiong
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test our selfmonitoring agent on a standard benchmark and analyze our proposed approach through a series of ablation studies that elucidate the contributions of the primary components. Using our proposed method, we set the new state of the art by a significant margin (8% absolute increase in success rate on the unseen test set). Code is available at https://github.com/chihyaoma/selfmonitoring-agent. |
| Researcher Affiliation | Collaboration | Georgia Institute of Technology {cyma,jiasenlu,alregib,zkira}@gatech.edu University of Maryland, College Park {zxwu}@cs.umd.edu Salesforce Research {rsocher,cxiong}@salesforce.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. It describes the model architecture and training process in text and equations. |
| Open Source Code | Yes | Code is available at https://github.com/chihyaoma/selfmonitoring-agent. |
| Open Datasets | Yes | R2R Dataset. We use the Room-to-Room (R2R) dataset (Anderson et al., 2018b) for evaluating our proposed approach. The R2R dataset is built upon the Matterport3D dataset (Chang et al., 2017) and has 7,189 paths sampled from its navigation graphs. Each path has three ground-truth navigation instructions written by humans. The whole dataset is divided into 4 sets: training, validation seen, validation unseen, and test sets unseen. |
| Dataset Splits | Yes | The whole dataset is divided into 4 sets: training, validation seen, validation unseen, and test sets unseen. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or cloud computing instance specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'ADAM as the optimizer' and components like 'LSTM' and 'MLP' but does not specify software versions for libraries (e.g., PyTorch, TensorFlow), programming languages (e.g., Python), or other ancillary software with version numbers. |
| Experiment Setup | Yes | Network architecture. The embedding dimension for encoding the navigation instruction is 256. We use a dropout layer with ratio 0.5 after the embedding layer. We then encode the instruction using a regular LSTM, and the hidden state is 512 dimensional. The MLP g used for projecting the raw image feature is BN FC BN Dropout Re LU. The FC layer projects the 2176-d input vector to a 1024-d vector, and the dropout ratio is set to be 0.5. The hidden state of the LSTM used for carrying the textual and visual information through time in Eq. 1 is 512. We set the maximum length of instruction to be 80, thus the dimension of the attention weights of textual grounding αt is also 80. The dimension of the learnable matrices from Eq. 2 to 5 are: Wx R512 512, Wv R512 1024, Wa R1024 1024, Wh R1536 512, and Wpm R592 1. Training. We use ADAM as the optimizer. The learning rate is 1e 4 with batch size of 64 consistently through out all experiments. When using beam search, we set the beam size to be 15. We perform categorical sampling during training for action selection. |