Hierarchical Question-Image Co-Attention for Visual Question Answering

Authors: Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using Res Net, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.1. We evaluate our proposed model on two large datasets, VQA [2] and COCO-QA [15]. We also perform ablation studies to quantify the roles of different components in our model.
Researcher Affiliation Academia Jiasen Lu , Jianwei Yang , Dhruv Batra , Devi Parikh Virginia Tech, Georgia Institute of Technology {jiasenlu, jw2yang, dbatra, parikh}@vt.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes 1The source code can be downloaded from https://github.com/jiasenlu/Hie Co Atten VQA
Open Datasets Yes VQA dataset [2] is the largest dataset for this problem, containing human annotated questions and answers on Microsoft COCO dataset [12]. COCO-QA dataset [15] is automatically generated from captions in the Microsoft COCO dataset [12].
Dataset Splits Yes VQA dataset [2] is the largest dataset for this problem, containing human annotated questions and answers on Microsoft COCO dataset [12]. The dataset contains 248,349 training questions, 121,512 validation questions, 244,302 testing questions, and a total of 6,141,630 question-answers pairs. For testing, we train our model on VQA train+val and report the test-dev and test-standard results from the VQA evaluation server.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies Yes We use Torch [4] to develop our model. [4] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In Big Learn, NIPS Workshop, 2011.
Experiment Setup Yes We use the Rmsprop optimizer with a base learning rate of 4e-4, momentum 0.99 and weight-decay 1e-8. We set batch size to be 300 and train for up to 256 epochs with early stopping if the validation accuracy has not improved in the last 5 epochs. For COCO-QA, the size of hidden layer Ws is set to 512 and 1024 for VQA since it is a much larger dataset. All the other word embedding and hidden layers were vectors of size 512. We apply dropout with probability 0.5 on each layer.