Human-Adversarial Visual Question Answering
Authors: Sasha Sheng, Amanpreet Singh, Vedanuj Goswami, Jose Magana, Tristan Thrush, Wojciech Galuba, Devi Parikh, Douwe Kiela
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate a wide range of existing VQA models on Ad VQA and find that their performance is significantly lower than on the commonly used VQA v2 dataset (see Table 1). Furthermore, we conduct an extensive analysis of Ad VQA characteristics, and contrast with the VQA v2 dataset. |
| Researcher Affiliation | Collaboration | Facebook AI Research Tecnológico de Monterrey Georgia Tech |
| Pseudocode | No | The paper describes data collection steps and model evaluation but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | The evaluation benchmark for Ad VQA is available at https://adversarialvqa.org for the community and we hope that Ad VQA will help bridge the gap by serving as a dynamic new benchmark for visual reasoning with a large amount of headroom for further progress in the field. |
| Open Datasets | Yes | The commonly used VQA dataset [20] was collected by instructing annotators to ask a question about this scene that [a] smart robot probably can not answer [4]. The VQA v2 dataset is based on COCO [39] images. We collected adversarial questions on both val2017 COCO images and testdev2015 COCO images. |
| Dataset Splits | Yes | We collected adversarial questions on both val2017 COCO images and testdev2015 COCO images. We then random sampled the collected set down to 2 questions per val2017 COCO images (10,000) and 1 question per testdev2015 COCO images (36,807). We train all the models in Table 3 on the VQA train + val split excluding the COCO 2017 validation images. We collect our Ad VQA validation set on the images from the COCO 2017 val split, ensuring there is no overlap with the training set images. We choose the best checkpoint for each model by validating on the Ad VQA validation set. |
| Hardware Specification | Yes | We run most of our experiments on NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions PyTorch in its references, but does not explicitly state specific version numbers for software dependencies used in its experiments within the main text. |
| Experiment Setup | Yes | We train all the models in Table 3 on the VQA train + val split excluding the COCO 2017 validation images. We do not do any hyperparamter search for these models and use the best hyperparams as provided by respective authors. We finetune each model with three different seeds and report average accuracy. |