reproducibilityindex.ai

Learning to Answer Questions from Image Using Convolutional Neural Network

Authors: Lin Ma, Zhengdong Lu, Hang Li

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the efﬁcacy of our proposed model on the DAQUAR and COCO-QA datasets, which are two benchmark datasets for image QA, with the performances signiﬁcantly outperforming the state-of-the-art. Experimental results on public image QA datasets show that our proposed CNN model surpasses the state-of-the-art.
Researcher Affiliation	Industry	Lin Ma Zhengdong Lu Hang Li Noah s Ark Lab, Huawei Technologies
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper provides a project website link (http://conviqa.noahlab.com.hk/project.html), but does not explicitly state that the source code for the methodology described in the paper is openly available or provide a direct link to a code repository.
Open Datasets	Yes	We test and compare our proposed CNN model on the public image QA databases, speciﬁcally the DAQUAR (Malinowski and Fritz 2014a) and COCO-QA (Ren, Kiros, and Zemel 2015) datasets. DAQUAR-All (Malinowski and Fritz 2014a) This dataset consists of 6,795 training and 5,673 testing samples, which are generated from 795 and 654 images, respectively. COCO-QA (Ren, Kiros, and Zemel 2015) This dataset consists of 79,100 training and 39,171 testing samples, which are generated from about 8,000 and 4,000 images, respectively.
Dataset Splits	No	The paper specifies training and testing sample counts for DAQUAR-All, DAQUAR-Reduced, and COCO-QA datasets, but does not explicitly mention a distinct validation set split with sample counts or percentages.
Hardware Specification	No	The paper describes the use of CNN models like VGG and general training parameters but does not provide specific hardware details (e.g., GPU models, CPU types, memory) used for running experiments.
Software Dependencies	No	The paper describes model components and training procedures but does not provide specific software dependencies with version numbers (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup	Yes	Three layers of convolution and max-pooling are employed for the sentence CNN. The numbers of the feature maps for the three convolution layers are 300, 400, and 400, respectively. The maximum length of the question is chosen as 38. The word embeddings are obtained by the skip-gram model (Mikolov et al. 2013) with the dimension as 50. We use the VGG (Simonyan and Zisserman 2014) network as the image CNN. The dimension of νim is set as 400. The multimodal CNN takes the image and sentence representations as the input and generate the joint representation with the number of feature maps as 400. The proposed CNN model is trained with stochastic gradient descent with mini batches of 100 for optimization, where the negative log likelihood is chosen as the loss. In order to prevent overﬁtting, dropout (with probability 0.1) is used.