REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering

Authors: Yuanze Lin, Yujia Xie, Dongdong Chen, Yichong Xu, Chenguang Zhu, Lu Yuan

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive experiments on the standard OK-VQA dataset and achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin (+3.6%). We also conduct detailed analysis and show the necessity of regional information in different framework components for knowledge-based VQA.
Researcher Affiliation Collaboration University of Washington Microsoft yuanze@uw.edu {yujiaxie, dochen, yicxu}@microsoft.com
Pseudocode No The paper describes the method using equations and text, but does not include explicit pseudocode or algorithm blocks.
Open Source Code Yes Code is publicly available at https://github.com/yzleroy/REVIVE.
Open Datasets Yes OK-VQA dataset [22] is selected for evaluation, which is currently the largest knowledgebased VQA dataset.
Dataset Splits No The paper states 'The training and testing split consist of 9009 and 5046 samples respectively' but does not explicitly mention a validation split or its size.
Hardware Specification Yes We use 4 NVIDIA V100 32Gb to train models for 10K steps, with a batch size of 8.
Software Dependencies No The paper mentions specific pre-trained models like 'GLIP-T', 'Vinvl-Large', 'CLIP model (Vi T-B/16 variant)', 'T5 model', and 'GPT-3', but does not provide specific version numbers for the underlying software libraries or environments (e.g., PyTorch version, Python version).
Experiment Setup Yes We use 4 NVIDIA V100 32Gb to train models for 10K steps, with a batch size of 8. The learning rate is 8e 5 and Adam W [19] is chosen as optimizer. The warm-up steps are 1K and the trained models are evaluated every 500 steps.