Improving Automatic VQA Evaluation Using Large Language Models
Authors: Oscar MaƱas, Benno Krojer, Aishwarya Agrawal
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the effectiveness of the proposed metric, we collect human judgments on the correctness of answers generated by several state-of-the-art VQA models across three popular VQA benchmarks. Our results demonstrate that LAVE correlates better with human judgment compared to existing metrics in diverse settings (Fig. 1). We also systematically categorize the failure modes of VQA Accuracy and show that LAVE is able to recover most missed correct candidate answers. In addition, we conduct ablation studies to assess the impact of each design choice on the performance of LAVE. |
| Researcher Affiliation | Academia | Oscar Ma nas1, 2, Benno Krojer1, 3, Aishwarya Agrawal1, 2 1Mila 2Universit e de Montr eal 3Mc Gill University |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | We plan to release the evaluation code and collected human judgments. |
| Open Datasets | Yes | We use these VQA models to generate answers for three VQA datasets: VQAv2 (Goyal et al. 2017), VG-QA (Krishna et al. 2017) and OK-VQA (Marino et al. 2019). |
| Dataset Splits | Yes | We additionally collected validation/development sets of human judgments for answers generated by BLIP-2 on VQAv2 and VG-QA, and BLIP on VQAv2 and VG-QA (1000 questions each). In total, our validation set contains 4k questions, which serve to guide our design choices. |
| Hardware Specification | No | The paper states: "We leverage the Hugging Face Transformers (Wolf et al. 2020) implementation of Flan-T5 and LLa MA (for Vicuna), and use GPT-3.5-Turbo through Open AI s API1." This describes the software and API used, but not the specific hardware (e.g., GPU models, CPU types) on which these computations were performed or accessed the API. |
| Software Dependencies | Yes | We consider Flan-T5-XXL and Vicuna-v1.3-13B as open-source LLMs, and GPT-3.5-Turbo (gpt-3.5-turbo-0613) as a closed-source LLM. We leverage the Hugging Face Transformers (Wolf et al. 2020) implementation of Flan-T5 and LLa MA (for Vicuna), and use GPT-3.5-Turbo through Open AI s API1. |
| Experiment Setup | Yes | We optimize our prompt for Flan-T5 (Sec. ) and subsequently use the same prompt with the other LLMs. To make generation deterministic, we perform greedy decoding, or equivalently set the temperature to 0 in Open AI s API. |