An Attention-based Regression Model for Grounding Textual Phrases in Images

Authors: Ko Endo, Masaki Aono, Eric Nichols, Kotaro Funakoshi

IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Despite the challenging nature of this task and sparsity of available data, in evaluation on the Refer It dataset, our proposed method achieves a new state-of-the-art in performance of 37.26% accuracy, surpassing the previously reported best by over 5 percentage points. We performed hyper-parameter optimization using random search. The hyper-parameters and final settings are shown in Table 1. In this section we present comparative evaluation against SCRC [Hu et al., 2016b] and the current state-of-the-art method, Grounde R [Rohrbach et al., 2016].
Researcher Affiliation Collaboration Ko Endo and Masaki Aono Toyohashi University of Technology {k-endo@kde.cs.tut.ac.jp,aono@tut.jp} Eric Nichols and Kotaro Funakoshi Honda Research Institute Japan {e.nichols,funakoshi}@jp.honda-ri.com
Pseudocode No The paper includes Figure 1 which is an "Overview of our proposed method" presented as a block diagram, not structured pseudocode or an algorithm block.
Open Source Code No The paper does not provide any concrete access to source code for the methodology described, nor does it state that the code will be made available.
Open Datasets Yes We used the Refer It dataset [Kazemzadeh et al., 2014] for both training and evaluation. This dataset consists of three parts: images, regions inside each image, and captions for each region. There are 20,000 total images, taken from the IAPRTC-12 dataset [Grubinger et al., 2006]. The regions come from the SAIAPR-12 dataset [Escalante et al., 2010].
Dataset Splits Yes We employed the same data splits as Hu et al. [2016b]: 9,000 images for training, 1,000 for validation, and 10,000 for testing to facilitate comparisons with prior approaches.
Hardware Specification No The paper mentions using the VGG 16-layer model but does not provide specific details about the hardware (e.g., GPU models, CPU models, memory, or cloud instances) used for running the experiments.
Software Dependencies No The paper mentions using LSTMs, Adam for SGD, and GloVe embeddings, but it does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks used in the implementation.
Experiment Setup Yes We performed hyper-parameter optimization using random search. The hyper-parameters and final settings are shown in Table 1. The size of each hidden layer is sampled by increment of 50. However, we selected the loss function weights from preliminary experiments because they have more influence over the training than other hyper-parameters. We fix the word embedding size to facilitate comparison between different embeddings and use the same 8,800 word vocabulary as [Hu et al., 2016b]. We train the model with back propagation using Adam [Kingma and Ba, 2014] for SGD, with the authors recommended values of hyper-parameters.