High-Order Attention Models for Visual Question Answering
Authors: Idan Schwartz, Alexander Schwing, Tamir Hazan
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our high-order attention mechanism on the task of visual question answering (VQA), where we achieve state-of-the-art performance on the standard VQA dataset. Tab. 1 shows the performance of our model and the baselines on the test-dev and the test-standard datasets for multiple choice (MC) questions. |
| Researcher Affiliation | Academia | Idan Schwartz Department of Computer Science Technion idansc@cs.technion.ac.il Alexander G. Schwing Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign aschwing@illinois.edu Tamir Hazan Department of Industrial Engineering & Management Technion tamir.hazan@gmail.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We implemented our models using the Torch framework1 [5]. 1https://github.com/idansc/High Order Atten |
| Open Datasets | Yes | We evaluate our attention modules on the VQA real-image test-dev and test-std datasets [2]. The dataset consists of 123, 287 training images and 81, 434 test set images. |
| Dataset Splits | No | Evaluating on the val dataset while training on the train part using the VGG features, the MCT setup yields 63.82% where 2-layer MCB yields 64.57%. The dataset consists of 123, 287 training images and 81, 434 test set images. |
| Hardware Specification | Yes | Our approach (Fig. 2) for the multiple choice answering task achieved the reported result after 180,000 iterations, which requires about 40 hours of training on the train+val dataset using a Titan X GPU. |
| Software Dependencies | No | We implemented our models using the Torch framework1 [5]. |
| Experiment Setup | Yes | We use the RMSProp optimizer with a base learning rate of 4e 4 and α = 0.99 as well as ϵ = 1e 8. The batch size is set to 300. The dimension d of all hidden layers is set to 512. The MCB unit feature dimension was set to d = 8192. We apply dropout with a rate of 0.5 after the word embeddings, the LSTM layer, and the first conv layer in the unary potential units. Additionally, for the last fully connected layer we use a dropout rate of 0.3. |