Free VQA Models from Knowledge Inertia by Pairwise Inconformity Learning

Authors: Yiyi Zhou, Rongrong Ji, Jinsong Su, Xiangming Li, Xiaoshuai Sun9316-9323

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To verify the proposed PIL, we plug it on a baseline VQA model as well as a set of recent VQA models, and conduct extensive experiments on two benchmark datasets, i.e., VQA1.0 and VQA2.0.
Researcher Affiliation Academia 1Fujian Key Laboratory of Sensing and Computing for Smart City, Department of Cognitive Science, School of Information Science and Engineering, Xiamen University, China 2School of Software Engineering, Xiamen University, China 3Peng Cheng Laboratory, China
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes 2https://github.com/xiangming Li/PIL
Open Datasets Yes VQA1.0 dataset contains 200,000 natural images from MS-COCO (Chen et al. 2015) with 614,153 human annotated questions in total. [...] VQA2.0 is developed based on VQA1.0, and has about 1,105,904 image-question pairs, of which 443,757 examples are for training, 214,254 for validation, and 447,793 for testing.
Dataset Splits Yes The whole dataset is divided into three splits, in which there are 248,349 examples for training, 121,512 for validation, and 244,302 for testing. [...] VQA2.0 is developed based on VQA1.0, and has about 1,105,904 image-question pairs, of which 443,757 examples are for training, 214,254 for validation, and 447,793 for testing.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processors, or memory used for running the experiments.
Software Dependencies No The paper mentions software components like 'Glove Embedding' and 'Adam optimizer' and 'LSTM network' but does not provide specific version numbers for any libraries or frameworks used.
Experiment Setup Yes The dimension of the LSTM module is 2048, while the k and o in MFB fusion (Yu et al. 2017) are set to 5 and 1000, respectively. The dimensions of the last forward layer and the projections are set to 2048 and 300. The two hyper-parameters, α and β, are set to 0.25 and 0.01 after tuning. The initial learning rate is 7e-4, which is halved after every 25,000 steps. The batch size is 64 and the maximum training step is 150,000. The optimizer we used is Adam (Kingma and Ba 2014).