reproducibilityindex.ai

Improving Automatic VQA Evaluation Using Large Language Models

Authors: Oscar Mañas, Benno Krojer, Aishwarya Agrawal

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate the effectiveness of the proposed metric, we collect human judgments on the correctness of answers generated by several state-of-the-art VQA models across three popular VQA benchmarks. Our results demonstrate that LAVE correlates better with human judgment compared to existing metrics in diverse settings (Fig. 1). We also systematically categorize the failure modes of VQA Accuracy and show that LAVE is able to recover most missed correct candidate answers. In addition, we conduct ablation studies to assess the impact of each design choice on the performance of LAVE.
Researcher Affiliation	Academia	Oscar Ma nas1, 2, Benno Krojer1, 3, Aishwarya Agrawal1, 2 1Mila 2Universit e de Montr eal 3Mc Gill University
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	We plan to release the evaluation code and collected human judgments.
Open Datasets	Yes	We use these VQA models to generate answers for three VQA datasets: VQAv2 (Goyal et al. 2017), VG-QA (Krishna et al. 2017) and OK-VQA (Marino et al. 2019).
Dataset Splits	Yes	We additionally collected validation/development sets of human judgments for answers generated by BLIP-2 on VQAv2 and VG-QA, and BLIP on VQAv2 and VG-QA (1000 questions each). In total, our validation set contains 4k questions, which serve to guide our design choices.
Hardware Specification	No	The paper states: "We leverage the Hugging Face Transformers (Wolf et al. 2020) implementation of Flan-T5 and LLa MA (for Vicuna), and use GPT-3.5-Turbo through Open AI s API1." This describes the software and API used, but not the specific hardware (e.g., GPU models, CPU types) on which these computations were performed or accessed the API.
Software Dependencies	Yes	We consider Flan-T5-XXL and Vicuna-v1.3-13B as open-source LLMs, and GPT-3.5-Turbo (gpt-3.5-turbo-0613) as a closed-source LLM. We leverage the Hugging Face Transformers (Wolf et al. 2020) implementation of Flan-T5 and LLa MA (for Vicuna), and use GPT-3.5-Turbo through Open AI s API1.
Experiment Setup	Yes	We optimize our prompt for Flan-T5 (Sec. ) and subsequently use the same prompt with the other LLMs. To make generation deterministic, we perform greedy decoding, or equivalently set the temperature to 0 in Open AI s API.