Exploring Human-Like Attention Supervision in Visual Question Answering

Authors: Tingting Qiao, Jianfeng Dong, Duanqing Xu

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments show that adding human-like supervision yields a more accurate attention together with a better performance, showing a promising future for human-like attention supervision in VQA. The overall accuracy of the attention-based model improves 0.15% on the VQA v2.0 test-dev set by adding human-like attention supervision, showing the effectiveness of explicit human-like attention supervision in VQA.
Researcher Affiliation Academia Tingting Qiao, Jianfeng Dong, Duanqing Xu Zhejiang University, China {qiaott, danieljf24, xdq}@zju.edu.cn
Pseudocode No The paper describes the model architecture and mathematical equations but does not include any explicit pseudocode blocks or algorithms.
Open Source Code No The paper states: 'The dataset is made available to the public.' referring to the HLAT dataset, but does not explicitly provide a link or statement that the source code for the methodology is publicly available.
Open Datasets Yes The HAN is evaluated on the recently released VQA-HAT dataset (Das et al. 2016)... The unsupervised attention model and the supervised attention model are evaluated on the more recent VQA v2.0 dataset, which contains 443,757 image-question pairs in the training set, 214,354 in the validation set and 447,793 in the testing set.
Dataset Splits Yes The unsupervised attention model and the supervised attention model are evaluated on the more recent VQA v2.0 dataset, which contains 443,757 image-question pairs in the training set, 214,354 in the validation set and 447,793 in the testing set.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The network is implemented using the Torch7 (Collobert, Kavukcuoglu, and Farabet 2011). However, a specific version number for Torch7 is not provided, nor are versions for other key libraries.
Experiment Setup Yes The number of the glimpses in the HAN is set to 3... The size of the hidden state in the GRU for refining the attention maps is set to 512. The joint embedding size d used for embedding images and questions is set to 1200. The Adam optimizer (Kingma and Ba 2014) is used with a base learning rate of 3e-4. The batch size is set to 64 and the iterations are set to 300k. Dropout is used with ratio 0.5. For VQA experiments, the number of glimpses is set to 1 and 2, and the number of iterations is fixed to 300k.