reproducibilityindex.ai

Human-Adversarial Visual Question Answering

Authors: Sasha Sheng, Amanpreet Singh, Vedanuj Goswami, Jose Magana, Tristan Thrush, Wojciech Galuba, Devi Parikh, Douwe Kiela

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate a wide range of existing VQA models on Ad VQA and find that their performance is significantly lower than on the commonly used VQA v2 dataset (see Table 1). Furthermore, we conduct an extensive analysis of Ad VQA characteristics, and contrast with the VQA v2 dataset.
Researcher Affiliation	Collaboration	Facebook AI Research Tecnológico de Monterrey Georgia Tech
Pseudocode	No	The paper describes data collection steps and model evaluation but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	The evaluation benchmark for Ad VQA is available at https://adversarialvqa.org for the community and we hope that Ad VQA will help bridge the gap by serving as a dynamic new benchmark for visual reasoning with a large amount of headroom for further progress in the field.
Open Datasets	Yes	The commonly used VQA dataset [20] was collected by instructing annotators to ask a question about this scene that [a] smart robot probably can not answer [4]. The VQA v2 dataset is based on COCO [39] images. We collected adversarial questions on both val2017 COCO images and testdev2015 COCO images.
Dataset Splits	Yes	We collected adversarial questions on both val2017 COCO images and testdev2015 COCO images. We then random sampled the collected set down to 2 questions per val2017 COCO images (10,000) and 1 question per testdev2015 COCO images (36,807). We train all the models in Table 3 on the VQA train + val split excluding the COCO 2017 validation images. We collect our Ad VQA validation set on the images from the COCO 2017 val split, ensuring there is no overlap with the training set images. We choose the best checkpoint for each model by validating on the Ad VQA validation set.
Hardware Specification	Yes	We run most of our experiments on NVIDIA V100 GPUs.
Software Dependencies	No	The paper mentions PyTorch in its references, but does not explicitly state specific version numbers for software dependencies used in its experiments within the main text.
Experiment Setup	Yes	We train all the models in Table 3 on the VQA train + val split excluding the COCO 2017 validation images. We do not do any hyperparamter search for these models and use the best hyperparams as provided by respective authors. We finetune each model with three different seeds and report average accuracy.