Hierarchical Question-Image Co-Attention for Visual Question Answering
Authors: Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using Res Net, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.1. We evaluate our proposed model on two large datasets, VQA [2] and COCO-QA [15]. We also perform ablation studies to quantify the roles of different components in our model. |
| Researcher Affiliation | Academia | Jiasen Lu , Jianwei Yang , Dhruv Batra , Devi Parikh Virginia Tech, Georgia Institute of Technology {jiasenlu, jw2yang, dbatra, parikh}@vt.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1The source code can be downloaded from https://github.com/jiasenlu/Hie Co Atten VQA |
| Open Datasets | Yes | VQA dataset [2] is the largest dataset for this problem, containing human annotated questions and answers on Microsoft COCO dataset [12]. COCO-QA dataset [15] is automatically generated from captions in the Microsoft COCO dataset [12]. |
| Dataset Splits | Yes | VQA dataset [2] is the largest dataset for this problem, containing human annotated questions and answers on Microsoft COCO dataset [12]. The dataset contains 248,349 training questions, 121,512 validation questions, 244,302 testing questions, and a total of 6,141,630 question-answers pairs. For testing, we train our model on VQA train+val and report the test-dev and test-standard results from the VQA evaluation server. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models. |
| Software Dependencies | Yes | We use Torch [4] to develop our model. [4] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In Big Learn, NIPS Workshop, 2011. |
| Experiment Setup | Yes | We use the Rmsprop optimizer with a base learning rate of 4e-4, momentum 0.99 and weight-decay 1e-8. We set batch size to be 300 and train for up to 256 epochs with early stopping if the validation accuracy has not improved in the last 5 epochs. For COCO-QA, the size of hidden layer Ws is set to 512 and 1024 for VQA since it is a much larger dataset. All the other word embedding and hidden layers were vectors of size 512. We apply dropout with probability 0.5 on each layer. |