Overcoming Language Priors with Self-supervised Learning for Visual Question Answering
Authors: Xi Zhu, Zhendong Mao, Chunxiao Liu, Peng Zhang, Bin Wang, Yongdong Zhang
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our method achieves state-of-the-art performance, improving the overall accuracy from 49.50% to 57.59% on the most commonly used benchmark VQA-CP v2. |
| Researcher Affiliation | Collaboration | 1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China 3University of Science and Technology of China, Hefei, China 4Xiaomi AI Lab, Xiaomi Inc., Beijing, China |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available on Git Hub1. 1https://github.com/Crossmodal Group/SSL-VQA |
| Open Datasets | Yes | Our approach is evaluated on the most commonly used benchmark VQA-CP v2 [Agrawal et al., 2018] with the standard evaluation metric [Antol et al., 2015]. |
| Dataset Splits | Yes | The VQA-CP v2 dataset is derived from VQA v2 [Goyal et al., 2017] by reorganizing the train and validation splits, and the Q-A pairs in the training set and test set have different distributions. ... We also evaluate our model on the VQA v2 dataset containing strong biases and report the results on its validation split. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like 'Faster R-CNN', 'Glove embeddings', 'GRU', and 'Adam optimizer' but does not specify their version numbers or the versions of underlying frameworks (e.g., PyTorch, TensorFlow). |
| Experiment Setup | Yes | We pre-train the model with the VQA loss for 12 epochs and fine-tune it with the self-supervised loss for 20 epochs. The batch size is 256, and the irrelevant images are randomly selected from mini-batches. The Adam optimizer is adopted with the initial learning rate of 0.001 which is halved every 5 epochs after 10 epochs. We evaluate our approach with different VQA losses in our main experiment, setting α = 3 for multi-label VQA loss and α = 1.2 for cross-entropy VQA loss. |