Supervising the Transfer of Reasoning Patterns in VQA

Authors: Corentin Kervadec, Christian Wolf, Grigory Antipov, Moez Baccouche, Madiha Nadri

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also demonstrate the effectiveness of this approach experimentally on the GQA dataset and show its complementarity to BERT-like self-supervised pre-training. and 5 Experimental results
Researcher Affiliation Collaboration 1Orange Innovation, France 2LIRIS, INSA-Lyon, France 3LAGEPP, Université de Lyon, France
Pseudocode No The paper describes the method's steps and components (e.g., program decoder), but it does not include any formal pseudocode blocks or algorithm listings.
Open Source Code No We do not include the code, but we provide instructions needed to reproduce our experimental results in Section 3
Open Datasets Yes We also demonstrate the effectiveness of this approach experimentally on the GQA dataset and We use ground truth information from the GQA [15] dataset and Evaluation: is performed on GQA [15] and GQA-OOD [18] test sets.
Dataset Splits Yes Our models are trained on the balanced GQA [15] training set ( 1M question-answer pairs). and Hyper-parameters are selected either on the test-dev (for GQA) or validation (for GQA-OOD) sets. and Evaluation: is performed on GQA [15] (test-dev and test-std) and GQA-OOD [18] test sets.
Hardware Specification No The hardware specifications are stated to be in the supplementary material, not directly in the main paper: 'See supp. mat.'
Software Dependencies No The paper mentions various models and architectures (e.g., LXMERT, BERT, faster-RCNN, Vin VL, GRU) that might imply software, but it does not specify any software names with version numbers required to reproduce the experiments (e.g., 'PyTorch 1.9', 'Python 3.8').
Experiment Setup Yes Hyper-parameters are selected either on the test-dev (for GQA) or validation (for GQA-OOD) sets. and we perform our experiments with a compact version of the Vision Language (VL)-Tansformer used in [30] (cf. Fig. 2), with a hidden embedding size of d=128 and h=4 heads per layer (only 26M trainable parameters). and we use faster-RCNN [25] with 36 objects per-images.