High-Order Attention Models for Visual Question Answering

Authors: Idan Schwartz, Alexander Schwing, Tamir Hazan

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our high-order attention mechanism on the task of visual question answering (VQA), where we achieve state-of-the-art performance on the standard VQA dataset. Tab. 1 shows the performance of our model and the baselines on the test-dev and the test-standard datasets for multiple choice (MC) questions.
Researcher Affiliation Academia Idan Schwartz Department of Computer Science Technion idansc@cs.technion.ac.il Alexander G. Schwing Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign aschwing@illinois.edu Tamir Hazan Department of Industrial Engineering & Management Technion tamir.hazan@gmail.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We implemented our models using the Torch framework1 [5]. 1https://github.com/idansc/High Order Atten
Open Datasets Yes We evaluate our attention modules on the VQA real-image test-dev and test-std datasets [2]. The dataset consists of 123, 287 training images and 81, 434 test set images.
Dataset Splits No Evaluating on the val dataset while training on the train part using the VGG features, the MCT setup yields 63.82% where 2-layer MCB yields 64.57%. The dataset consists of 123, 287 training images and 81, 434 test set images.
Hardware Specification Yes Our approach (Fig. 2) for the multiple choice answering task achieved the reported result after 180,000 iterations, which requires about 40 hours of training on the train+val dataset using a Titan X GPU.
Software Dependencies No We implemented our models using the Torch framework1 [5].
Experiment Setup Yes We use the RMSProp optimizer with a base learning rate of 4e 4 and α = 0.99 as well as ϵ = 1e 8. The batch size is set to 300. The dimension d of all hidden layers is set to 512. The MCB unit feature dimension was set to d = 8192. We apply dropout with a rate of 0.5 after the word embeddings, the LSTM layer, and the first conv layer in the unary potential units. Additionally, for the last fully connected layer we use a dropout rate of 0.3.