Using Syntax to Ground Referring Expressions in Natural Images

Authors: Volkan Cirik, Taylor Berg-Kirkpatrick, Louis-Philippe Morency

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that Ground Net achieves state-of-the-art accuracy in identifying supporting objects, while maintaining comparable performance in the localization of target objects. Using these additional annotations, our empirical evaluations demonstrate that Gound Net substantially outperforms the state-of-the-art at intermediate predictions of the supporting objects, yet maintains comparable accuracy at target object localization.
Researcher Affiliation Academia Volkan Cirik, Taylor Berg-Kirkpatrick, Louis-Philippe Morency School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 {vcirik,tberg,morency}@cs.cmu.edu
Pseudocode Yes Algorithm 1: Generate Computation Graph
Open Source Code Yes Our annotations for supporting objects and implementations are available for public use1. 1https://github.com/volkancirik/groundnet
Open Datasets Yes We use the standard Google-Ref (Mao et al. 2016) benchmark for our experiments. We additionally present a new set of annotations on Google-Ref dataset. Our annotations for supporting objects and implementations are available for public use1.
Dataset Splits Yes best validation split which is 2,5% of training data separated from training split.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) were mentioned for running experiments.
Software Dependencies No The paper mentions software components like GloVe, Faster-RCNN, VGG-16 network, Stanford Parser, LSTM, and Xavier initialization, but no specific version numbers are provided for any of these dependencies.
Experiment Setup Yes We trained Ground Net with backpropagation. We used stochastic gradient descent for 6 epochs with and initial learning rate of 0.01 and multiplied by 0.4 after each epoch. Hidden layer size of LSTM networks was searched over the range of {64,128,...,1024} and picked based on best validation split which is 2,5% of training data separated from training split. We initialized all parameters of the model with Xavier initialization (Glorot and Bengio 2010) and used weight decay rate of 0.0005 as regularization.