Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries

Authors: Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, Vicente Ordonez

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare our method with existing sequential encoding and embedding networks, demonstrating superior performance on two proposed benchmarks: automatic image retrieval on a simulated scenario that uses region captions as queries, and interactive image retrieval using real queries from human evaluators. We demonstrate the effectiveness of our approach on the Visual Genome dataset [20] in two scenarios: automatic image retrieval using region captions as queries, and interactive image retrieval with real queries from human evaluators. In both cases, our experimental results show that the proposed model outperforms existing methods, such as a hierarchical recurrent encoder model [29], while using less computational budget.
Researcher Affiliation Collaboration Fuwen Tan University of Virginia fuwen.tan@virginia.edu Paola Cascante-Bonilla University of Virginia pc9za@virginia.com Xiaoxiao Guo IBM Research AI xiaoxiao.guo@ibm.com Hui Wu IBM Research AI wuhu@us.ibm.com Song Feng IBM Research AI sfeng@us.ibm.com Vicente Ordonez University of Virginia vicente@virginia.edu
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes 1Codes are available at https://github.com/uvavision/Drill Down
Open Datasets Yes We evaluate the performance of our method on the Visual Genome dataset [20].
Dataset Splits Yes This preprocessing results in 105,414 image samples, which are further split into 92,105/5,000/9,896 for training/validation/testing.
Hardware Specification No The paper does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper mentions software like 'Faster RCNN detector', 'Res Net152', and 'Adam' but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes In particular, all the models are trained with 10-turn queries (T = 10). For each image, we extract the top 36 regions (N = 36) detected by a pretrained Faster RCNN model, following [1]. Each embeded word vector has a dimension of 300 (E = 300). In all our experiments, we set the temperature parameter σ to 9, the margin parameter α to 0.2, the discount factor γ to 1.0, and the trade-off factor µ to 0.1. For optimization, we use Adam [16] with an initial learning rate of 2e 4 and a batch size of 128. We clip the gradients in the back-propagation such that the norm of the gradients is not larger than 10. All models are trained with at most 300 epochs, validated after each epoch.