Multimodal Learning and Reasoning for Visual Question Answering
Authors: Ilija Ilievski, Jiashi Feng
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform an extensive evaluation and achieve new state-of-the-art performance on the two VQA benchmark datasets. |
| Researcher Affiliation | Academia | Ilija Ilievski Integrative Sciences and Engineering National University of Singapore ilija.ilievski@u.nus.edu Jiashi Feng Electrical and Computer Engineering National University of Singapore elefjia@nus.edu.sg |
| Pseudocode | No | The paper describes its model architecture and components in detail using mathematical formulas and text, but it does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper refers to external open-source code used for components or baselines (e.g., "https://github.com/facebookresearch/deepmask", "https://github.com/kpzhang93/MTCNN_face_detection_alignment"), but it does not provide concrete access to the source code for their own proposed "Reason Net" methodology. |
| Open Datasets | Yes | We evaluate our model on the two benchmark VQA datasets, VQA v1.0 [3] and VQA v2.0 [7]. |
| Dataset Splits | Yes | We train the models on the VQA v2.0 train set and evaluate them on the validation set. |
| Hardware Specification | No | The paper describes model architectures and training procedures but does not explicitly detail the specific hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | The paper mentions optimizers (Adam) and references other methods or pre-trained models, but it does not provide specific software dependencies with version numbers (e.g., Python version, specific library versions like PyTorch 1.9). |
| Experiment Setup | Yes | Following [18, 6] the images are scaled and center-cropped to a dimensionality of 3 448 448, then are fed through a Res Net-152 [8] pretrained on Image Net [36]. ... The lookup table matrix uses 300-dimensional vectors, initialized with word2vec [31] vectors. ... The encoder units encodes the module outputs to 500-dimensional vectors, with a hidden layer of 1,500 dimensions. Each bilinear interaction model outputs a 500-dimensional interaction vector, i.e. 500 500 500. The classification network classifies the reasoning vector g using one hidden layer of 2,500 dimensions to one of 4,096 most common answers in the training set. We jointly optimize the parameters of the encoder units, the bilinear models, and the answer classification network using Adam [19] with a learning rate of 0.0007, without learning rate decay. We apply a gradient clipping threshold of 5 and use dropout[41] (with p(keep) = 0.5) layers before and batch normalization[13] after each fully-conected layer as a regularization. |