Learning to Answer Questions from Image Using Convolutional Neural Network

Authors: Lin Ma, Zhengdong Lu, Hang Li

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the efficacy of our proposed model on the DAQUAR and COCO-QA datasets, which are two benchmark datasets for image QA, with the performances significantly outperforming the state-of-the-art. Experimental results on public image QA datasets show that our proposed CNN model surpasses the state-of-the-art.
Researcher Affiliation Industry Lin Ma Zhengdong Lu Hang Li Noah s Ark Lab, Huawei Technologies
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper provides a project website link (http://conviqa.noahlab.com.hk/project.html), but does not explicitly state that the source code for the methodology described in the paper is openly available or provide a direct link to a code repository.
Open Datasets Yes We test and compare our proposed CNN model on the public image QA databases, specifically the DAQUAR (Malinowski and Fritz 2014a) and COCO-QA (Ren, Kiros, and Zemel 2015) datasets. DAQUAR-All (Malinowski and Fritz 2014a) This dataset consists of 6,795 training and 5,673 testing samples, which are generated from 795 and 654 images, respectively. COCO-QA (Ren, Kiros, and Zemel 2015) This dataset consists of 79,100 training and 39,171 testing samples, which are generated from about 8,000 and 4,000 images, respectively.
Dataset Splits No The paper specifies training and testing sample counts for DAQUAR-All, DAQUAR-Reduced, and COCO-QA datasets, but does not explicitly mention a distinct validation set split with sample counts or percentages.
Hardware Specification No The paper describes the use of CNN models like VGG and general training parameters but does not provide specific hardware details (e.g., GPU models, CPU types, memory) used for running experiments.
Software Dependencies No The paper describes model components and training procedures but does not provide specific software dependencies with version numbers (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup Yes Three layers of convolution and max-pooling are employed for the sentence CNN. The numbers of the feature maps for the three convolution layers are 300, 400, and 400, respectively. The maximum length of the question is chosen as 38. The word embeddings are obtained by the skip-gram model (Mikolov et al. 2013) with the dimension as 50. We use the VGG (Simonyan and Zisserman 2014) network as the image CNN. The dimension of νim is set as 400. The multimodal CNN takes the image and sentence representations as the input and generate the joint representation with the number of feature maps as 400. The proposed CNN model is trained with stochastic gradient descent with mini batches of 100 for optimization, where the negative log likelihood is chosen as the loss. In order to prevent overfitting, dropout (with probability 0.1) is used.