Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
A Unified Hallucination Mitigation Framework for Large Vision-Language Models
Authors: Yue Chang, Liqiang Jing, Xiaopeng Zhang, Yue Zhang
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We complete quantitative experiments on MMbench (Liu et al., 2023e), LLa VA-QA90 (Liu et al., 2023c), CHAIR (Rohrbach et al., 2018) and POPE (Li et al., 2023c) using three models: Instruct BLIP (Dai et al., 2023), Visual GLM (Ding et al., 2021; Du et al., 2022) and LLa VA (Liu et al., 2023c), respectively, to test the effectiveness of our proposed method. In this section, we conduct extensive experiments to answer the following research questions: RQ1. Could our framework improve the current LVLMs? RQ2. What is the contribution of each component of our Dentist ? RQ3. What is the intuitive performance of our Dentist ? |
| Researcher Affiliation | Academia | Yue Chang Liqiang Jing Xiaopeng Zhang Yue Zhang The University of Texas at Dallas EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Dentist Input: Original question Q, original image I, original answer ˆY , the large vision-language model M, the maximum iteration T Output: Corrected answer Y 1: Yi {} 2: Yl ˆY 3: for j in 1,2...T do 4: Yt V erify(Q, I, Yl, M) 5: if j = 1 then 6: Yi Yt 7: end if 8: if Similar(Yl, Yt) Y es then 9: # No improvement in new round of verification 10: return Yl 11: else 12: Yl Yt 13: end if 14: end for 15: # Arrive the maximum iteration, so return Yi 16: return Yi |
| Open Source Code | Yes | As a byproduct, we released our code1. 1https://github.com/CYand Yue/Dentist. |
| Open Datasets | Yes | We complete quantitative experiments on MMbench (Liu et al., 2023e), LLa VA-QA90 (Liu et al., 2023c), CHAIR (Rohrbach et al., 2018) and POPE (Li et al., 2023c) using three models: Instruct BLIP (Dai et al., 2023), Visual GLM (Ding et al., 2021; Du et al., 2022) and LLa VA (Liu et al., 2023c), respectively, to test the effectiveness of our proposed method. LLa VA-QA90 contains 90 questions and 30 images taken from COCO Val 2014 (Lin et al., 2014). |
| Dataset Splits | Yes | The dataset we use is MMBench-Test(EN). LLa VA-QA90 contains 90 questions and 30 images taken from COCO Val 2014 (Lin et al., 2014). In terms of sampling settings, we sample 100 images and construct 6 questions for each type of sampling setting for each image. Three kinds of sampling settings of random, popular, adversarial, are constructed on the dataset according to human annotation or automatic visual segmentation tools. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or memory) are mentioned for running their experiments. The paper refers to using GPT-3.5-turbo-0613 and GPT-4V via API, but not the local experimental setup hardware. |
| Software Dependencies | Yes | We utilize GPT-3.5-turbo-06132 to assist in keyword extraction, sub-question generation, verification loop, and verification answer integration. The cost of calling the gpt-3.5-turbo-0613 API when evaluating on MMBench. In one round of evaluation, the total cost was about $2.75, with an average cost of $0.0004 per question. A.9 Prompt for GPT-4V-aided evaluation and GPT-3.5-aided precision calculation. |
| Experiment Setup | Yes | In the following experiments, we limit the maximum number of iterations to 3 (i.e., T in Algorithm 1 is equal to 3) to ensure the effectiveness of the verification and avoid excessive time costs. (1) In the first round of evaluation, we have the model generate raw predictions according to MMBench s evaluation rules and submit them to MMBench s official platform to obtain various accuracy rates; (2) In the second round of evaluation, based on the original prediction of the model, query classification, different verification processes and answer integration are carried out using GPT-3.5-turbo (specific details can be found in Section 3). |