Overcoming Language Priors in VQA via Decomposed Linguistic Representations

Authors: Chenchen Jing, Yuwei Wu, Xiaoxun Zhang, Yunde Jia, Qi Wu11181-11188

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the VQA-CP dataset demonstrate the effectiveness of our method. [...] Results and Analysis Comparison with the state-of-the-art. The results of our method and state-of-the-art VQA models on the VQACP v2 dataset are listed in Table 1.
Researcher Affiliation Collaboration Chenchen Jing,1 Yuwei Wu,1 Xiaoxun Zhang,2 Yunde Jia,1 Qi Wu3 1Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, China 2Alibaba Group 3Australian Centre for Robotic Vision, University of Adelaide, Australia
Pseudocode No The paper describes algorithms and modules in detail using mathematical equations, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes We evaluate the effectiveness of the proposed method in the VQA-CP v2 dataset (Agrawal et al. 2018) using standard VQA evaluation metric (Antol et al. 2015). [...] We also report the results on the validation split of the original VQA v2 dataset for completeness.
Dataset Splits Yes The train split and test split of VQA-CP v2 is created by re-organizing the train split and validation split of the VQA v2 (Goyal et al. 2017). [...] For each image, the Up Dn generates no more than 100 proposals with its 2048-d feature. The questions are preprocessed to a maximum of 14 words. [...] In the VQA-CP, we set the number of training epochs as 30 and the final model is used for evaluation without early-stopping because there is no validation set.
Hardware Specification No The paper mentions that the model is built on 'bottom-up and top-down attention (Up Dn) method' but does not specify any hardware details like GPU models, CPU, or memory used for training or inference.
Software Dependencies No The paper mentions 'pre-trained Glo Ve (Pennington, Socher, and Manning 2014)' and 'Gated Recurrent Unit (GRU) (Cho et al. 2014)' but does not provide specific version numbers for these or other software libraries/frameworks used.
Experiment Setup Yes Implementation Detail. We build our model on the bottom-up and top-down attention (Up Dn) method (Anderson et al. 2018) as (Ramakrishnan, Agrawal, and Lee 2018) and (Selvaraju et al. 2019). [...] The questions are preprocessed to a maximum of 14 words. [...] The pre-trained Glo Ve is used to initialize the word embeddings with the dimension of 300 and then the GRU is used to obtain sentence-level question embeddings with the dimension of 512. In our implementation, we set K as 36 for each image, thus the dimension of features of an image is 36 2048. [...] For the language attention module and the question identification module, we set the threshold β in the language attention module as 0.1, which is a little bigger than average attention weight, i.e., 0.07. In the VQA-CP, we set the number of training epochs as 30.