Dynamic Language Binding in Relational Visual Reasoning

Authors: Thao Minh Le, Vuong Le, Svetha Venkatesh, Truyen Tran

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness of this model is demonstrated on image question answering demonstrating favorable performance on major VQA datasets. We apply our model on major VQA datasets. Both qualitative and quantitative results indicate that LOGNet has advantages over state-of-the-art methods in answering long and complex questions. Our results show superior performance even when trained on just 10% of data. We evaluate our model on multiple datasets including: CLEVR, CLEVR-Human, GQA, VQA v2. We conduct ablation studies with our model on CLEVR subset of 10% training data (See Table 4).
Researcher Affiliation Academia Thao Minh Le , Vuong Le , Svetha Venkatesh and Truyen Tran Applied Artificial Intelligence Institute, Deakin University, Australia {lethao,vuong.le,svetha.venkatesh,truyen.tran}@deakin.edu.au
Pseudocode No The paper describes the model in detail but does not provide structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link for open-source code.
Open Datasets Yes We evaluate our model on multiple datasets including: CLEVR [Johnson et al., 2017a]: presents several reasoning tasks such as transitive relations and attribute comparison. CLEVR-Human [Johnson et al., 2017b]: composes natural language question-answer pairs on images from CLEVR. GQA [Hudson and Manning, 2019b]: the current largest visual relational reasoning dataset providing semantic scene graphs coupled with images. VQA v2 [Goyal et al., 2017]: As a large portion of questions is short and can be answered by looking for facts in images, we design experiments with a split of only long questions (>7 words).
Dataset Splits No The paper mentions 'Val. Acc. (%)' in tables, indicating a validation set was used, but does not explicitly provide exact split percentages or sample counts for training, validation, and test sets. It mentions '10% of training data' or '20% and 50% splits' which refer to the size of the training subset used, not a complete train/validation/test partition.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU, CPU models, memory) used for running experiments.
Software Dependencies No The paper mentions software components like Faster R-CNN, bi LSTM, GCN, and Res Net, and pretrained GloVe vectors, but it does not specify version numbers for these or any other ancillary software dependencies.
Experiment Setup Yes Our model is generally implemented with feature dimension d = 512, reasoning depth T = 8, GCN depth H = 8 and attention-width K = 2. The number of regions is N = 14 for CLEVR and CLEVR-Human, and 100 for GQA and 36 for VQA v2 to match with other related methods. We also match the word embeddings with others by using random vectors of a uniform distribution for CLEVR/CLEVR-Human and pretrained Glo Ve vectors for the other datasets.