Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models

Authors: Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that REPARE improves zero-shot VQA performance by up to 3.85%, 6.41%, and 7.94% on the VQAv2 (Goyal et al., 2017), A-OKVQA (Schwenk et al., 2022), and Viz Wiz (Gurari et al., 2018) datasets, respectively using LVLMs including BLIP-2 (Li et al., 2023), Mini GPT4 (Zhu et al., 2023b), and LLa VA-1.5 (Liu et al., 2023a) models in Sec. 4. Note that all percentages we report in this paper are absolute improvements. We further demonstrate the capabilities of REPARE in an oracle setting, establishing an upper-bound performance increase of up to 9.84%, 14.41%, and 20.09% on VQAv2, A-OKVQA, and Viz Wiz tasks, respectively. We extensively evaluate our design choices in Sec. 4.1 and quantitatively show the importance of incorporating visual information to address underspecification, as done in REPARE, compared to paraphrasing in Sec. 4.2. We analyze REPARE s outputs using linguistically-informed metrics like average dependency distance (Gibson et al., 2000) and idea density (Boschi et al., 2017). This reveals that the resulting questions are indeed less underspecified, i.e., more complex (see Sec. 4.3). Finally, in Sec. 4.4, we verify that questions from REPARE make better use of existing LVLMs by leveraging the strength of the LLM while still benefitting from the image.
Researcher Affiliation Academia Archiki Prasad Elias Stengel-Eskin Mohit Bansal Department of Computer Science University of North Carolina at Chapel Hill {archiki, esteng, mbansal}@cs.unc.edu
Pseudocode No The paper describes the method and its stages using textual descriptions and a schematic diagram (Figure 2), but it does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes 1Our code is puplicly available: https://github.com/archiki/Rep ARe
Open Datasets Yes Empirically, we show that REPARE improves zero-shot VQA performance by up to 3.85%, 6.41%, and 7.94% on the VQAv2 (Goyal et al., 2017), A-OKVQA (Schwenk et al., 2022), and Viz Wiz (Gurari et al., 2018) datasets, respectively using LVLMs including BLIP-2 (Li et al., 2023), Mini GPT4 (Zhu et al., 2023b), and LLa VA-1.5 (Liu et al., 2023a) models in Sec. 4.
Dataset Splits Yes Since the test sets of these benchmarks are not publicly available, we report performance on the validation sets (unless mentioned otherwise). Lastly, we also evaluate on the challenging Viz Wiz benchmark (Gurari et al., 2018) consisting of real-life information-seeking questions about (often low-quality) images sourced from visually-impaired people. While developing REPARE, we sample a small set of data points from the train set of the datasets to form our dev set. In the direct answer setting, we use the standard soft VQA evaluation metric for VQAv2, Viz Wiz, and A-OKVQA (Antol et al., 2015). In A-OKVQA s MC setting, we use accuracy. See Appendix A.1 for further dataset details.
Hardware Specification No The paper mentions the models used (BLIP-2, Mini GPT-4, LLaVA-1.5) and their LLM components (Flan-T5, Vicuna) along with parameter counts (e.g., 1B, 0.11B, 3B for BLIP-2 Flan T5 xl), but it does not specify the actual hardware (e.g., GPU models, CPU types, memory) on which these models were trained or experiments were run.
Software Dependencies No The paper mentions several software components and tools such as BLIP-2, Mini GPT-4, LLaVA-1.5, Pegasus (paraphrasing model), Hugging Face (used for models), Bla Bla toolkit, Stanza, and rake nltk python package. However, it does not provide specific version numbers for these software dependencies or libraries (e.g., 'PyTorch 1.x', 'transformers 4.x'). It only cites the papers or provides links to specific model checkpoints without stating the software versions used to run them.
Experiment Setup Yes We use n = 5 as default in all our experiments and discuss the impact of increasing n as well as using the full LVLM for sentence fusion in Appendix A.5. To ensure the paraphrasing model generates a valid question ending with ? we employ constrained decoding by setting a positive constraint on generating the ? token (Post & Vilar, 2018; Hu et al., 2019). To ensure diverse samples in the sentence fusion stage (determines the diversity of question candidates) we use top-p sampling Holtzman et al. (2019) with p = 0.95. To sample rationales, we employ beam search with 5 beams and a temperature of 0.7.