Learning Conditioned Graph Structures for Interpretable Visual Question Answering
Authors: Will Norcliffe-Brown, Stathis Vafeias, Sarah Parisot
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test our approach on the VQA v2 dataset using a simple baseline architecture enhanced by the proposed graph learner module. |
| Researcher Affiliation | Industry | Will Norcliffe-Brown Aim Brain Ltd. will.norcliffe@aimbrain.com Efstathios Vafeias Aim Brain Ltd. stathis@aimbrain.com Sarah Parisot Aim Brain Ltd. sarah@aimbrain.com |
| Pseudocode | No | The paper describes the model architecture and components in detail, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code can be found at github.com/aimbrain/vqa-project. |
| Open Datasets | Yes | We evaluate our model using the VQA 2.0 dataset [20] which contains a total of 1,105,904 questions and about 204,721 images from the COCO dataset. |
| Dataset Splits | Yes | The dataset is split up roughly into proportions of 40%, 20%, 40% for train, validation and test sets respectively. |
| Hardware Specification | No | The paper does not specify the hardware used for experiments (e.g., GPU models, CPU types, or memory). |
| Software Dependencies | No | The paper mentions using a 'dynamic Gated Recurrent Unit (GRU) [17]' and 'Adam optimizer [22]' but does not specify software dependencies with version numbers (e.g., Python 3.x, TensorFlow 2.x, PyTorch 1.x). |
| Experiment Setup | Yes | Our question encoder is a dynamic Gated Recurrent Unit (GRU) [17] with a hidden state size of 1024 (dq = 1024). Our function F (see Eq. 1), which learns the adjacency matrix, comprises two dense linear layers of size 512 (dg = 512). We use L=2 spatial graph convolution layers of dimensions 2048 and 1024 so that (dh1 = 2048, dh2 = 1024). All dense layers and convolutional layers are activated using Rectified Linear Unit (Re LU) activation functions. During training we use dropout on the image features and all but the final dense layers nodes with a 0.5 probability. We train for 35 epochs using batch size of 64 and the Adam optimizer [22] with a learning rate of 0.0001 which we halve after the 30th epoch. |