reproducibilityindex.ai

Multimodal Learning and Reasoning for Visual Question Answering

Authors: Ilija Ilievski, Jiashi Feng

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform an extensive evaluation and achieve new state-of-the-art performance on the two VQA benchmark datasets.
Researcher Affiliation	Academia	Ilija Ilievski Integrative Sciences and Engineering National University of Singapore ilija.ilievski@u.nus.edu Jiashi Feng Electrical and Computer Engineering National University of Singapore elefjia@nus.edu.sg
Pseudocode	No	The paper describes its model architecture and components in detail using mathematical formulas and text, but it does not include any pseudocode or algorithm blocks.
Open Source Code	No	The paper refers to external open-source code used for components or baselines (e.g., "https://github.com/facebookresearch/deepmask", "https://github.com/kpzhang93/MTCNN_face_detection_alignment"), but it does not provide concrete access to the source code for their own proposed "Reason Net" methodology.
Open Datasets	Yes	We evaluate our model on the two benchmark VQA datasets, VQA v1.0 [3] and VQA v2.0 [7].
Dataset Splits	Yes	We train the models on the VQA v2.0 train set and evaluate them on the validation set.
Hardware Specification	No	The paper describes model architectures and training procedures but does not explicitly detail the specific hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	The paper mentions optimizers (Adam) and references other methods or pre-trained models, but it does not provide specific software dependencies with version numbers (e.g., Python version, specific library versions like PyTorch 1.9).
Experiment Setup	Yes	Following [18, 6] the images are scaled and center-cropped to a dimensionality of 3 448 448, then are fed through a Res Net-152 [8] pretrained on Image Net [36]. ... The lookup table matrix uses 300-dimensional vectors, initialized with word2vec [31] vectors. ... The encoder units encodes the module outputs to 500-dimensional vectors, with a hidden layer of 1,500 dimensions. Each bilinear interaction model outputs a 500-dimensional interaction vector, i.e. 500 500 500. The classiﬁcation network classiﬁes the reasoning vector g using one hidden layer of 2,500 dimensions to one of 4,096 most common answers in the training set. We jointly optimize the parameters of the encoder units, the bilinear models, and the answer classiﬁcation network using Adam [19] with a learning rate of 0.0007, without learning rate decay. We apply a gradient clipping threshold of 5 and use dropout[41] (with p(keep) = 0.5) layers before and batch normalization[13] after each fully-conected layer as a regularization.