Multi-Modal Answer Validation for Knowledge-Based VQA
Authors: Jialin Wu, Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi2712-2721
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments with OK-VQA, a challenging knowledge-based VQA dataset, demonstrate that MAVEx achieves new state-of-the-art results. |
| Researcher Affiliation | Collaboration | Jialin Wu1, Jiasen Lu2, Ashish Sabharwal2, Roozbeh Mottaghi2 1 The University of Texas at Austin 2 Allen Institute for AI jialinwu@utexas.edu, {jiasenl, ashishs, roozbehm}@allenai.org |
| Pseudocode | No | The paper describes the framework steps but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/jialinwu17/MAVEX |
| Open Datasets | Yes | We evaluate MAVEx on OK-VQA (Marino et al. 2019), the largest knowledge-based VQA dataset to date. |
| Dataset Splits | Yes | We evaluate MAVEx on OK-VQA (Marino et al. 2019), the largest knowledge-based VQA dataset at present. The dataset contains 14,031 images and 14,055 questions... We use the finetuned model to extract the top 5 answers for each question in the training and test set. |
| Hardware Specification | Yes | We use Pytorch 1.4 on a single TITAN V GPU with 12M memory for each run, and it generally costs 22 hours to train a single model. |
| Software Dependencies | No | The paper mentions 'Pytorch 1.4' but does not provide version numbers for other significant software dependencies such as Allen NLP, T5 model, Mask R-CNN, or specific BERT/Tiny BERT implementations used. |
| Experiment Setup | Yes | We finetune the Vi LBERT-multi-task model on OK-VQA using the default configuration for 150 epochs for answer candidate generation... We train the system for 75 epochs using a learning rate of 2e-5 for the Vi LBERT parameters and 5e-5 for the additional parameters introduced in the validation module... The number of hidden units in the multi-head attention modules is set to 512. |