Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts
Authors: Raymond Yeh, Jinjun Xiong, Wen-Mei Hwu, Minh Do, Alexander Schwing
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | At the time of submission, our approach outperformed the current state-of-the-art methods on the Flickr 30k Entities and the Refer It Game dataset by 3.08% and 7.77% respectively. |
| Researcher Affiliation | Collaboration | Raymond A. Yeh, Jinjun Xiong, Wen-mei W. Hwu, Minh N. Do, Alexander G. Schwing Department of Electrical Engineering, University of Illinois at Urbana-Champaign IBM Thomas J. Watson Research Center yeh17@illinois.edu, jinjun@us.ibm.com, w-hwu@illinois.edu, minhdo@illinois.edu, aschwing@illinois.edu |
| Pseudocode | Yes | Algorithm 1 Branch and bound inference for grounding |
| Open Source Code | No | The paper mentions 'Our C++ implementation' but does not provide any specific link or explicit statement about releasing the code for the methodology described. |
| Open Datasets | Yes | We evaluate our proposed approach on the challenging Refer It Game [20] and the Flickr 30k Entities dataset [35] |
| Dataset Splits | Yes | For the Refer It Game, we use the same bounding boxes as [38] and the same training test set split, i.e., 10,000 images for testing, 9,000 images for training and 1,000 images for validation. For the Flickr 30k Entities, we us the same training, validation and testing split as in [35]. |
| Hardware Specification | No | The paper mentions 'on a CPU' and 'on a GPU' but does not specify any particular models or detailed hardware specifications. |
| Software Dependencies | No | The paper mentions software systems like 'Deep Lab system [4]' and 'YOLO object detection system [37]' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | For segmentation, we use the Deep Lab system [4], trained on PASCAL VOC-2012 [8] semantic image segmentation task, to extract the probability maps for 21 categories. For detection, we use the YOLO object detection system [37], to extract 101 categories, 21 trained on PASCAL VOC-2012, and 80 trained on MSCOCO [28]. The feature maps are resized to dimension of 64 x 64 for efficient computation, and the predicted box is scaled back to the original image dimension during evaluation. We re-center the prediction box by a constant amount determined using the validation set, as resizing truncate box coordinates to an integer. |