Improving Automatic VQA Evaluation Using Large Language Models

Authors: Oscar MaƱas, Benno Krojer, Aishwarya Agrawal

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the effectiveness of the proposed metric, we collect human judgments on the correctness of answers generated by several state-of-the-art VQA models across three popular VQA benchmarks. Our results demonstrate that LAVE correlates better with human judgment compared to existing metrics in diverse settings (Fig. 1). We also systematically categorize the failure modes of VQA Accuracy and show that LAVE is able to recover most missed correct candidate answers. In addition, we conduct ablation studies to assess the impact of each design choice on the performance of LAVE.
Researcher Affiliation Academia Oscar Ma nas1, 2, Benno Krojer1, 3, Aishwarya Agrawal1, 2 1Mila 2Universit e de Montr eal 3Mc Gill University
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No We plan to release the evaluation code and collected human judgments.
Open Datasets Yes We use these VQA models to generate answers for three VQA datasets: VQAv2 (Goyal et al. 2017), VG-QA (Krishna et al. 2017) and OK-VQA (Marino et al. 2019).
Dataset Splits Yes We additionally collected validation/development sets of human judgments for answers generated by BLIP-2 on VQAv2 and VG-QA, and BLIP on VQAv2 and VG-QA (1000 questions each). In total, our validation set contains 4k questions, which serve to guide our design choices.
Hardware Specification No The paper states: "We leverage the Hugging Face Transformers (Wolf et al. 2020) implementation of Flan-T5 and LLa MA (for Vicuna), and use GPT-3.5-Turbo through Open AI s API1." This describes the software and API used, but not the specific hardware (e.g., GPU models, CPU types) on which these computations were performed or accessed the API.
Software Dependencies Yes We consider Flan-T5-XXL and Vicuna-v1.3-13B as open-source LLMs, and GPT-3.5-Turbo (gpt-3.5-turbo-0613) as a closed-source LLM. We leverage the Hugging Face Transformers (Wolf et al. 2020) implementation of Flan-T5 and LLa MA (for Vicuna), and use GPT-3.5-Turbo through Open AI s API1.
Experiment Setup Yes We optimize our prompt for Flan-T5 (Sec. ) and subsequently use the same prompt with the other LLMs. To make generation deterministic, we perform greedy decoding, or equivalently set the temperature to 0 in Open AI s API.