Dynamic Capsule Attention for Visual Question Answering

Authors: Yiyi Zhou, Rongrong Ji, Jinsong Su, Xiaoshuai Sun, Weiqiu Chen9324-9331

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate the merits of the proposed Caps Att, we first conduct extensive experiments on three VQA datasets, i.e., COCO-QA (Ren, Kiros, and Zemel 2015), VQA1.0 (Antol et al. 2015) and VQA2.0 (Goyal et al. 2017). The experimental results show that, Caps Att can obtain significant improvements on all three datasets compared with the classic multi-step attention model.
Researcher Affiliation Academia Yiyi Zhou,1 Rongrong Ji,1 Jinsong Su,2 Xiaoshuai Sun,1 Weiqiu Chen1 1Fujian Key Laboratory of Sensing and Computing for Smart City, Department of Cognitive Science, School of Information Science and Engineering, Xiamen University, China 2School of Software Engineering, Xiamen University, China Peng Cheng Laboratory, China
Pseudocode Yes Algorithm 1 Dynamic Capsule Attention Input: feature matrix F and reference vector h Output: coupling coefficients c and the output capsule s 1: Initialize projection matrices, Wf and Wh 2: Project F and h, and obtain f i p Fp and hp 3: Initialize s0 with hp 4: for t in N iterations do 5: Obtain coupling coefficients: ci softmax (bi) 6: Obtain weighted-sum feature: f p a PK i cif i p 7: Update output capsule: st st 1 + f p a 8: Update agreements: bi f i p st + bi 9: end for 10: return c, s N.
Open Source Code Yes 1https://github.com/XMUVQA/CapsAtt
Open Datasets Yes To validate the merits of the proposed Caps Att, we first conduct extensive experiments on three VQA datasets, i.e., COCO-QA (Ren, Kiros, and Zemel 2015), VQA1.0 (Antol et al. 2015) and VQA2.0 (Goyal et al. 2017).
Dataset Splits Yes VQA1.0 dataset contains... 248,349 examples for training, 121,512 for validation, and 244,302 for testing. VQA2.0 is developed based on VQA1.0... 443,757 examples are for training, 214,254 are for validation, and 447,793 are for testing. COCO-captions... 82,783, 40,504 and 40,775 images for training, validation and test, respectively.
Hardware Specification No No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running the experiments are mentioned.
Software Dependencies No The paper mentions types of neural network components (e.g., CNN, GRU, LSTM, Adam) but does not provide specific version numbers for any software, libraries, or solvers used in the experiments.
Experiment Setup Yes On COCO-QA, the visual feature used is the convolution feature map before the first fully-connected layer of VGG16 (Simonyan and Zisserman 2014) with a size of 14 14 512. The dimensions of forward and backward GRU units are set to 256. So the whole dimension of the output question feature is 512. The attention dimension in Caps Att is set to 512, and the ones of the two FC layers are both 512 as well. The number of answer category is set to 434. The initial learning rate is 7e-4, which is half decayed after each 10,000 training steps. On VQA1.0 and VQA2.0, the visual feature input is the feature map before the last pooling layer of Res Net-152 (He et al. 2016) with a size of 14 14 2, 048. We also use the Faster RCNN features from (Anderson et al. 2018) with a size of 36 2, 048, which is labeled as FRCNN to distinguish. The dimensions of GRU units, Caps Att and FC layers are set to 512, 1,024, 2,048 respectively. The answer dimensions on these two datasets are both set to 3,000. The initial learning rate is set to 7e-4 with a decay step of 25,000 and a decay rate of 0.5, while the batch size is 124. The maximum training step is set to 200,000 and the validation step is 5,000 steps. Early stop is applied when the performance does not increase after 5 validations.