Look Wide and Interpret Twice: Improving Performance on Interactive Instruction-following Tasks

Authors: Van-Quang Nguyen, Masanori Suganuma, Takayuki Okatani

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The current version achieves the unseen environment s success rate of 4.45% with a single view, which is further improved to 8.37% with multiple views. ... We follow the standard procedure of ALFRED; 25,743 language directives over 8,055 expert demonstration episodes are split into the training, validation, and test sets. ... Table 1 shows the results. It is seen that our method shows significant improvement over the previous methods ... We conduct an ablation test to validate the effectiveness of the components by incrementally adding each component to the proposed model. The results are shown in Table 3.
Researcher Affiliation Academia Van-Quang Nguyen1 , Masanori Suganuma2,1 , Takayuki Okatani1,2 1Graduate School of Information Sciences, Tohoku University 2RIKEN Center for AIP {quang,suganuma,okatani}@vision.is.tohoku.ac.jp
Pseudocode No The paper describes the components and their functions but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions 'Our Arxiv version is available: https://arxiv.org/abs/2106.00596' and 'The ALFRED Challenge 2020 https://askforalfred.com/EVAL', but neither provides concrete access to the source code for the methodology described in this paper.
Open Datasets Yes To consider more complex tasks, a benchmark named ALFRED was developed recently [Shridhar et al., 2020]. It requires an agent to accomplish a household task in interactive environments following given language directives. ... ALFRED is built upon AI2Thor [Kolve et al., 2017], a simulation environment for embodied AI. ... Dataset. We follow the standard procedure of ALFRED; 25,743 language directives over 8,055 expert demonstration episodes are split into the training, validation, and test sets. The latter two are further divided into two splits, called seen and unseen, depending on whether the scenes are included in the training set.
Dataset Splits Yes We follow the standard procedure of ALFRED; 25,743 language directives over 8,055 expert demonstration episodes are split into the training, validation, and test sets. The latter two are further divided into two splits, called seen and unseen, depending on whether the scenes are included in the training set.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running experiments.
Software Dependencies No The paper mentions software components like 'Mask R-CNN' and 'ResNet-50 backbone' but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We use the Adam optimizer with an initial learning rate of 10-3, which is halved at epoch 5, 8, and 10, and a batch size of 32 for 15 epochs in total. We use a dropout with the dropout probability 0.2 for the both visual features and LSTM decoder hidden states.