Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Unified Hallucination Mitigation Framework for Large Vision-Language Models

Authors: Yue Chang, Liqiang Jing, Xiaopeng Zhang, Yue Zhang

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We complete quantitative experiments on MMbench (Liu et al., 2023e), LLa VA-QA90 (Liu et al., 2023c), CHAIR (Rohrbach et al., 2018) and POPE (Li et al., 2023c) using three models: Instruct BLIP (Dai et al., 2023), Visual GLM (Ding et al., 2021; Du et al., 2022) and LLa VA (Liu et al., 2023c), respectively, to test the effectiveness of our proposed method. In this section, we conduct extensive experiments to answer the following research questions: RQ1. Could our framework improve the current LVLMs? RQ2. What is the contribution of each component of our Dentist ? RQ3. What is the intuitive performance of our Dentist ?
Researcher Affiliation Academia Yue Chang Liqiang Jing Xiaopeng Zhang Yue Zhang The University of Texas at Dallas EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Dentist Input: Original question Q, original image I, original answer ˆY , the large vision-language model M, the maximum iteration T Output: Corrected answer Y 1: Yi {} 2: Yl ˆY 3: for j in 1,2...T do 4: Yt V erify(Q, I, Yl, M) 5: if j = 1 then 6: Yi Yt 7: end if 8: if Similar(Yl, Yt) Y es then 9: # No improvement in new round of verification 10: return Yl 11: else 12: Yl Yt 13: end if 14: end for 15: # Arrive the maximum iteration, so return Yi 16: return Yi
Open Source Code Yes As a byproduct, we released our code1. 1https://github.com/CYand Yue/Dentist.
Open Datasets Yes We complete quantitative experiments on MMbench (Liu et al., 2023e), LLa VA-QA90 (Liu et al., 2023c), CHAIR (Rohrbach et al., 2018) and POPE (Li et al., 2023c) using three models: Instruct BLIP (Dai et al., 2023), Visual GLM (Ding et al., 2021; Du et al., 2022) and LLa VA (Liu et al., 2023c), respectively, to test the effectiveness of our proposed method. LLa VA-QA90 contains 90 questions and 30 images taken from COCO Val 2014 (Lin et al., 2014).
Dataset Splits Yes The dataset we use is MMBench-Test(EN). LLa VA-QA90 contains 90 questions and 30 images taken from COCO Val 2014 (Lin et al., 2014). In terms of sampling settings, we sample 100 images and construct 6 questions for each type of sampling setting for each image. Three kinds of sampling settings of random, popular, adversarial, are constructed on the dataset according to human annotation or automatic visual segmentation tools.
Hardware Specification No No specific hardware details (like GPU models, CPU types, or memory) are mentioned for running their experiments. The paper refers to using GPT-3.5-turbo-0613 and GPT-4V via API, but not the local experimental setup hardware.
Software Dependencies Yes We utilize GPT-3.5-turbo-06132 to assist in keyword extraction, sub-question generation, verification loop, and verification answer integration. The cost of calling the gpt-3.5-turbo-0613 API when evaluating on MMBench. In one round of evaluation, the total cost was about $2.75, with an average cost of $0.0004 per question. A.9 Prompt for GPT-4V-aided evaluation and GPT-3.5-aided precision calculation.
Experiment Setup Yes In the following experiments, we limit the maximum number of iterations to 3 (i.e., T in Algorithm 1 is equal to 3) to ensure the effectiveness of the verification and avoid excessive time costs. (1) In the first round of evaluation, we have the model generate raw predictions according to MMBench s evaluation rules and submit them to MMBench s official platform to obtain various accuracy rates; (2) In the second round of evaluation, based on the original prediction of the model, query classification, different verification processes and answer integration are carried out using GPT-3.5-turbo (specific details can be found in Section 3).