Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance

Authors: Linxi Zhao, Yihe Deng, Weitong Zhang, Quanquan Gu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive evaluations across 5 popular LVLMs with diverse evaluation metrics and benchmarks, we demonstrate the effectiveness of MARINE, which even outperforms existing fine-tuning-based methods. Remarkably, it reduces hallucinations consistently in GPT-4V-assisted evaluation while maintaining the detailedness of LVLMs generations. We release our code at https://github.com/ Linxi-ZHAO/MARINE.
Researcher Affiliation	Academia	1Department of Computer Science, Cornell University, Ithaca, NY, USA 2Department of Computer Science, University of California, Los Angeles, CA, USA 3School of Data Science and Society, UNC, Chapel Hill, NC, USA. Correspondence to: Quanquan Gu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Mitigating hallucin Ation via image-g Rounded gu Ida Nc E (MARINE) 1: Input: LLM parameter θ, input prompt x, visual tokens v from LVLM s original vision tower 2: Input: auxiliary visual tokens {ci}M i=1 from M image grounding models, guidance scale γ 3: Initialize empty output y = []. 4: Aggregate visual information as textual prompt c = Aggr.({ci}M i=1) 5: for t = 0, 1, . . . , T do 6: Construct unconditional input x(t) uncond = [v, x, y<t]. 7: Generate unconditional output logits using LLM: ℓ(t) uncond = log pθ(x(t) uncond). 8: Construct conditional input x(t) cond = [v, c, x, y<t]. 9: Generate conditional output logits using LLM: ℓ(t) cond = log pθ(x(t) cond). 10: Update output logits ℓ(t) = γℓ(t) cond + (1 γ)ℓ(t) uncond. 11: Sample token yt from logit space denoted by ℓ(t). 12: Let y = [y, yt]. 13: end for 14: Output: y.
Open Source Code	Yes	We release our code at https://github.com/ Linxi-ZHAO/MARINE.
Open Datasets	Yes	Empirical evaluations are conducted on five widely-recognized LVLMs across benchmarks including MSCOCO (Lin et al., 2014), LLa VA-QA90 task (Liu et al., 2023d), A-OKVQA (Schwenk et al., 2022), and GQA (Hudson & Manning, 2019).
Dataset Splits	Yes	Consistent with Li et al. (2023b), we randomly sampled a subset of 500 images from MSCOCO (Lin et al., 2014) dataset for CHAIR evaluation. For the POPE evaluation, we created 3000 questions across three datasets 500 images each from MSCOCO, A-OKVQA (Schwenk et al., 2022), and GQA (Hudson & Manning, 2019). For the GPT-4Vaided evaluation, we utilized 90 questions from the LLa VA-QA90 task and randomly selected 50 MSCOCO images for image captioning task.
Hardware Specification	Yes	We conduct all of the experiments using 8 A6000 GPU with 48GB GPU memory. Each single experiment can be run on a single A6000 GPU.
Software Dependencies	No	The paper mentions various LVLM architectures (LLa VA, Mini GPT-v2, m PLUG-Owl2, Instruct BLIP) and their underlying components (CLIP-L, LLa MA-2-7B-Chat, Vicuna-v1.5-7B, EVA-G, BLIP-2) with citations, but does not provide specific version numbers for software libraries or programming languages used for implementation. It mentions 'NLTK package' in an ablation study but without a version number.
Experiment Setup	Yes	Hyperparameter setting. The hyperparameters for our method are fixed across tasks, with key settings including a guidance strength of 0.7, score threshold for DETR at 0.95, a detection threshold for RAM++ of 0.68, and a greedy sampling approach with a random seed of 242.