Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries
Authors: Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, Vicente Ordonez
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare our method with existing sequential encoding and embedding networks, demonstrating superior performance on two proposed benchmarks: automatic image retrieval on a simulated scenario that uses region captions as queries, and interactive image retrieval using real queries from human evaluators. We demonstrate the effectiveness of our approach on the Visual Genome dataset [20] in two scenarios: automatic image retrieval using region captions as queries, and interactive image retrieval with real queries from human evaluators. In both cases, our experimental results show that the proposed model outperforms existing methods, such as a hierarchical recurrent encoder model [29], while using less computational budget. |
| Researcher Affiliation | Collaboration | Fuwen Tan University of Virginia fuwen.tan@virginia.edu Paola Cascante-Bonilla University of Virginia pc9za@virginia.com Xiaoxiao Guo IBM Research AI xiaoxiao.guo@ibm.com Hui Wu IBM Research AI wuhu@us.ibm.com Song Feng IBM Research AI sfeng@us.ibm.com Vicente Ordonez University of Virginia vicente@virginia.edu |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Codes are available at https://github.com/uvavision/Drill Down |
| Open Datasets | Yes | We evaluate the performance of our method on the Visual Genome dataset [20]. |
| Dataset Splits | Yes | This preprocessing results in 105,414 image samples, which are further split into 92,105/5,000/9,896 for training/validation/testing. |
| Hardware Specification | No | The paper does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions software like 'Faster RCNN detector', 'Res Net152', and 'Adam' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | In particular, all the models are trained with 10-turn queries (T = 10). For each image, we extract the top 36 regions (N = 36) detected by a pretrained Faster RCNN model, following [1]. Each embeded word vector has a dimension of 300 (E = 300). In all our experiments, we set the temperature parameter σ to 9, the margin parameter α to 0.2, the discount factor γ to 1.0, and the trade-off factor µ to 0.1. For optimization, we use Adam [16] with an initial learning rate of 2e 4 and a batch size of 128. We clip the gradients in the back-propagation such that the norm of the gradients is not larger than 10. All models are trained with at most 300 epochs, validated after each epoch. |