Multimodal Graph Networks for Compositional Generalization in Visual Question Answering
Authors: Raeid Saqur, Karthik Narasimhan
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate MGN on two tasks a binary classification task of predicting if a caption matches an image based on attribute compositions in the CLEVR dataset [28], and CLOSURE [6] a recently released challenge for testing systematic generalization in language. |
| Researcher Affiliation | Collaboration | 1University of Toronto Computer Science 2Princeton University, Computer Science 3Vector Institute for Artificial Intelligence raeidsaqur@cs.[toronto|princeton].edu Karthik Narasimhan Department of Computer Science Princeton University karthikn@cs.princeton.edu |
| Pseudocode | No | The paper describes processes and architectures but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/raeidsaqur/mgn |
| Open Datasets | Yes | We use images from the CLEVR dataset [28] and use their template generator to produce captions that are both true and false. The original dataset contains 1M questions generated from 100k questions with 90 question template families... |
| Dataset Splits | Yes | All models were trained using Adam with a learning rate of 5 10 4, a batch size of 64 for a maximum of 360k iterations, with early stopping based on validation accuracy. |
| Hardware Specification | No | No specific hardware (e.g., GPU/CPU models, memory details) used for running experiments was mentioned in the paper. |
| Software Dependencies | No | The paper mentions "Py Torch Geometric [13]" and the "en_core_web_sm 3 LM", but does not provide specific version numbers for PyTorch, SpaCy, or PyTorch Geometric itself. |
| Experiment Setup | Yes | All models were trained using Adam with a learning rate of 5 10 4, a batch size of 64 for a maximum of 360k iterations, with early stopping based on validation accuracy. ... A learning rate of 0.01 with weight decay 5 10 4 was used with the cross-entropy loss function. ... Both the encoder and decoder have hidden layers with a 256-dim hidden vector. We set the dimensions of both the encoder and decoder word vectors to be 300, and the multimodal graph vector representation to be 100. ... We use a learning rate of 1 10 5 and a batch size of 64 for a maximum of 1,000,000 iterations. |