reproducibilityindex.ai

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

Authors: Raymond Yeh, Jinjun Xiong, Wen-Mei Hwu, Minh Do, Alexander Schwing

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	At the time of submission, our approach outperformed the current state-of-the-art methods on the Flickr 30k Entities and the Refer It Game dataset by 3.08% and 7.77% respectively.
Researcher Affiliation	Collaboration	Raymond A. Yeh, Jinjun Xiong, Wen-mei W. Hwu, Minh N. Do, Alexander G. Schwing Department of Electrical Engineering, University of Illinois at Urbana-Champaign IBM Thomas J. Watson Research Center yeh17@illinois.edu, jinjun@us.ibm.com, w-hwu@illinois.edu, minhdo@illinois.edu, aschwing@illinois.edu
Pseudocode	Yes	Algorithm 1 Branch and bound inference for grounding
Open Source Code	No	The paper mentions 'Our C++ implementation' but does not provide any specific link or explicit statement about releasing the code for the methodology described.
Open Datasets	Yes	We evaluate our proposed approach on the challenging Refer It Game [20] and the Flickr 30k Entities dataset [35]
Dataset Splits	Yes	For the Refer It Game, we use the same bounding boxes as [38] and the same training test set split, i.e., 10,000 images for testing, 9,000 images for training and 1,000 images for validation. For the Flickr 30k Entities, we us the same training, validation and testing split as in [35].
Hardware Specification	No	The paper mentions 'on a CPU' and 'on a GPU' but does not specify any particular models or detailed hardware specifications.
Software Dependencies	No	The paper mentions software systems like 'Deep Lab system [4]' and 'YOLO object detection system [37]' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	For segmentation, we use the Deep Lab system [4], trained on PASCAL VOC-2012 [8] semantic image segmentation task, to extract the probability maps for 21 categories. For detection, we use the YOLO object detection system [37], to extract 101 categories, 21 trained on PASCAL VOC-2012, and 80 trained on MSCOCO [28]. The feature maps are resized to dimension of 64 x 64 for efﬁcient computation, and the predicted box is scaled back to the original image dimension during evaluation. We re-center the prediction box by a constant amount determined using the validation set, as resizing truncate box coordinates to an integer.