Multimodal Graph Networks for Compositional Generalization in Visual Question Answering

Authors: Raeid Saqur, Karthik Narasimhan

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate MGN on two tasks a binary classification task of predicting if a caption matches an image based on attribute compositions in the CLEVR dataset [28], and CLOSURE [6] a recently released challenge for testing systematic generalization in language.
Researcher Affiliation Collaboration 1University of Toronto Computer Science 2Princeton University, Computer Science 3Vector Institute for Artificial Intelligence raeidsaqur@cs.[toronto|princeton].edu Karthik Narasimhan Department of Computer Science Princeton University karthikn@cs.princeton.edu
Pseudocode No The paper describes processes and architectures but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/raeidsaqur/mgn
Open Datasets Yes We use images from the CLEVR dataset [28] and use their template generator to produce captions that are both true and false. The original dataset contains 1M questions generated from 100k questions with 90 question template families...
Dataset Splits Yes All models were trained using Adam with a learning rate of 5 10 4, a batch size of 64 for a maximum of 360k iterations, with early stopping based on validation accuracy.
Hardware Specification No No specific hardware (e.g., GPU/CPU models, memory details) used for running experiments was mentioned in the paper.
Software Dependencies No The paper mentions "Py Torch Geometric [13]" and the "en_core_web_sm 3 LM", but does not provide specific version numbers for PyTorch, SpaCy, or PyTorch Geometric itself.
Experiment Setup Yes All models were trained using Adam with a learning rate of 5 10 4, a batch size of 64 for a maximum of 360k iterations, with early stopping based on validation accuracy. ... A learning rate of 0.01 with weight decay 5 10 4 was used with the cross-entropy loss function. ... Both the encoder and decoder have hidden layers with a 256-dim hidden vector. We set the dimensions of both the encoder and decoder word vectors to be 300, and the multimodal graph vector representation to be 100. ... We use a learning rate of 1 10 5 and a batch size of 64 for a maximum of 1,000,000 iterations.