reproducibilityindex.ai

High-Order Attention Models for Visual Question Answering

Authors: Idan Schwartz, Alexander Schwing, Tamir Hazan

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our high-order attention mechanism on the task of visual question answering (VQA), where we achieve state-of-the-art performance on the standard VQA dataset. Tab. 1 shows the performance of our model and the baselines on the test-dev and the test-standard datasets for multiple choice (MC) questions.
Researcher Affiliation	Academia	Idan Schwartz Department of Computer Science Technion idansc@cs.technion.ac.il Alexander G. Schwing Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign aschwing@illinois.edu Tamir Hazan Department of Industrial Engineering & Management Technion tamir.hazan@gmail.com
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	We implemented our models using the Torch framework1 [5]. 1https://github.com/idansc/High Order Atten
Open Datasets	Yes	We evaluate our attention modules on the VQA real-image test-dev and test-std datasets [2]. The dataset consists of 123, 287 training images and 81, 434 test set images.
Dataset Splits	No	Evaluating on the val dataset while training on the train part using the VGG features, the MCT setup yields 63.82% where 2-layer MCB yields 64.57%. The dataset consists of 123, 287 training images and 81, 434 test set images.
Hardware Specification	Yes	Our approach (Fig. 2) for the multiple choice answering task achieved the reported result after 180,000 iterations, which requires about 40 hours of training on the train+val dataset using a Titan X GPU.
Software Dependencies	No	We implemented our models using the Torch framework1 [5].
Experiment Setup	Yes	We use the RMSProp optimizer with a base learning rate of 4e 4 and α = 0.99 as well as ϵ = 1e 8. The batch size is set to 300. The dimension d of all hidden layers is set to 512. The MCB unit feature dimension was set to d = 8192. We apply dropout with a rate of 0.5 after the word embeddings, the LSTM layer, and the ﬁrst conv layer in the unary potential units. Additionally, for the last fully connected layer we use a dropout rate of 0.3.