Overcoming Language Priors with Self-supervised Learning for Visual Question Answering

Authors: Xi Zhu, Zhendong Mao, Chunxiao Liu, Peng Zhang, Bin Wang, Yongdong Zhang

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our method achieves state-of-the-art performance, improving the overall accuracy from 49.50% to 57.59% on the most commonly used benchmark VQA-CP v2.
Researcher Affiliation Collaboration 1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China 3University of Science and Technology of China, Hefei, China 4Xiaomi AI Lab, Xiaomi Inc., Beijing, China
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available on Git Hub1. 1https://github.com/Crossmodal Group/SSL-VQA
Open Datasets Yes Our approach is evaluated on the most commonly used benchmark VQA-CP v2 [Agrawal et al., 2018] with the standard evaluation metric [Antol et al., 2015].
Dataset Splits Yes The VQA-CP v2 dataset is derived from VQA v2 [Goyal et al., 2017] by reorganizing the train and validation splits, and the Q-A pairs in the training set and test set have different distributions. ... We also evaluate our model on the VQA v2 dataset containing strong biases and report the results on its validation split.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions software components like 'Faster R-CNN', 'Glove embeddings', 'GRU', and 'Adam optimizer' but does not specify their version numbers or the versions of underlying frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup Yes We pre-train the model with the VQA loss for 12 epochs and fine-tune it with the self-supervised loss for 20 epochs. The batch size is 256, and the irrelevant images are randomly selected from mini-batches. The Adam optimizer is adopted with the initial learning rate of 0.001 which is halved every 5 epochs after 10 epochs. We evaluate our approach with different VQA losses in our main experiment, setting α = 3 for multi-label VQA loss and α = 1.2 for cross-entropy VQA loss.