Debiased Visual Question Answering from Feature and Sample Perspectives

Authors: Zhiquan Wen, Guanghui Xu, Mingkui Tan, Qingyao Wu, Qi Wu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the VQA-CP v2 and VQA v2 datasets demonstrate the effectiveness of our D-VQA method.
Researcher Affiliation Academia Zhiquan Wen1,2, Guanghui Xu1, Mingkui Tan1,3 , Qingyao Wu1 , Qi Wu4 1School of Software Engineering, South China University of Technology, China 2Peng Cheng Laboratory, China 3Key Laboratory of Big Data and Intelligent Robot (South China University of Technology), Ministry of Education 4School of Computer Science, University of Adelaide {sewenzhiquan, sexuguanghui}@mail.scut.edu.cn, {mingkuitan, qyw}@scut.edu.cn, qi.wu01@adelaide.edu.au
Pseudocode No The paper describes its method in detail using prose and mathematical formulations but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The source code and the pre-trained models are available at https://github.com/Zhiquan-Wen/D-VQA.
Open Datasets Yes To demonstrate the effectiveness of our D-VQA method, we evaluate it on the out-of-distribution benchmark dataset VQA-CP (Visual Question Answering under Changing Priors) v2 [3] and IID dataset VQA v2 [18] validation set based on the standard evaluation metric [6].
Dataset Splits Yes The training set of VQA-CP v2 contains approximately 121k images and 245k questions, while the test set contains approximately 98k images and 220k questions. VQA v2 [18] validation set
Hardware Specification Yes The model is trained with one Titan Xp GPU.
Software Dependencies No The paper mentions software like 'Py Torch [30]' and 'LXMERT [37]' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes Specifically, we train all the branches with the binary cross-entropy loss and contrastive loss over the training process, and the sample perspective loss is introduced at the 13-th epoch. We adopt Adam optimiser with the initial learning rate of 1e-3, and the learning rate decreases by half every 5 epochs after 10 epochs. The batch size is set to 256. For the backbone of LXMERT [37], ... we train LXMERT + D-VQA for 10 epochs, and the sample perspective loss is introduced at the 7-th epoch. The batch size is 32, and the learning rate is 1e-5.