Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance

Authors: Linxi Zhao, Yihe Deng, Weitong Zhang, Quanquan Gu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comprehensive evaluations across 5 popular LVLMs with diverse evaluation metrics and benchmarks, we demonstrate the effectiveness of MARINE, which even outperforms existing fine-tuning-based methods. Remarkably, it reduces hallucinations consistently in GPT-4V-assisted evaluation while maintaining the detailedness of LVLMs generations. We release our code at https://github.com/ Linxi-ZHAO/MARINE.
Researcher Affiliation Academia 1Department of Computer Science, Cornell University, Ithaca, NY, USA 2Department of Computer Science, University of California, Los Angeles, CA, USA 3School of Data Science and Society, UNC, Chapel Hill, NC, USA. Correspondence to: Quanquan Gu <EMAIL>.
Pseudocode Yes Algorithm 1 Mitigating hallucin Ation via image-g Rounded gu Ida Nc E (MARINE) 1: Input: LLM parameter Īø, input prompt x, visual tokens v from LVLM s original vision tower 2: Input: auxiliary visual tokens {ci}M i=1 from M image grounding models, guidance scale γ 3: Initialize empty output y = []. 4: Aggregate visual information as textual prompt c = Aggr.({ci}M i=1) 5: for t = 0, 1, . . . , T do 6: Construct unconditional input x(t) uncond = [v, x, y<t]. 7: Generate unconditional output logits using LLM: ā„“(t) uncond = log pĪø(x(t) uncond). 8: Construct conditional input x(t) cond = [v, c, x, y<t]. 9: Generate conditional output logits using LLM: ā„“(t) cond = log pĪø(x(t) cond). 10: Update output logits ā„“(t) = γℓ(t) cond + (1 γ)ā„“(t) uncond. 11: Sample token yt from logit space denoted by ā„“(t). 12: Let y = [y, yt]. 13: end for 14: Output: y.
Open Source Code Yes We release our code at https://github.com/ Linxi-ZHAO/MARINE.
Open Datasets Yes Empirical evaluations are conducted on five widely-recognized LVLMs across benchmarks including MSCOCO (Lin et al., 2014), LLa VA-QA90 task (Liu et al., 2023d), A-OKVQA (Schwenk et al., 2022), and GQA (Hudson & Manning, 2019).
Dataset Splits Yes Consistent with Li et al. (2023b), we randomly sampled a subset of 500 images from MSCOCO (Lin et al., 2014) dataset for CHAIR evaluation. For the POPE evaluation, we created 3000 questions across three datasets 500 images each from MSCOCO, A-OKVQA (Schwenk et al., 2022), and GQA (Hudson & Manning, 2019). For the GPT-4Vaided evaluation, we utilized 90 questions from the LLa VA-QA90 task and randomly selected 50 MSCOCO images for image captioning task.
Hardware Specification Yes We conduct all of the experiments using 8 A6000 GPU with 48GB GPU memory. Each single experiment can be run on a single A6000 GPU.
Software Dependencies No The paper mentions various LVLM architectures (LLa VA, Mini GPT-v2, m PLUG-Owl2, Instruct BLIP) and their underlying components (CLIP-L, LLa MA-2-7B-Chat, Vicuna-v1.5-7B, EVA-G, BLIP-2) with citations, but does not provide specific version numbers for software libraries or programming languages used for implementation. It mentions 'NLTK package' in an ablation study but without a version number.
Experiment Setup Yes Hyperparameter setting. The hyperparameters for our method are fixed across tasks, with key settings including a guidance strength of 0.7, score threshold for DETR at 0.95, a detection threshold for RAM++ of 0.68, and a greedy sampling approach with a random seed of 242.