Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats

Authors: Jiaye Qian, Ge Zheng, Yuchen Zhu, Sibei Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types. ... 5 Experiments
Researcher Affiliation	Academia	1School of Computer Science and Engineering, Sun Yat-sen University 2Shanghai Tech University
Pseudocode	Yes	Algorithm 1: Get Scores Set for Short Answering Questions ... Algorithm 2: Get Scores Set for Open-ended Questions
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Code will be released after acceptance.
Open Datasets	Yes	POPE [29], the Polling-based Object Probing Evaluation (POPE) benchmark is a widely utilized benchmark for evaluating object hallucination in LVLMs. Built upon three established datasets, COCO [33], A-OKVQA [46], and GQA [22] ... CHAIR [45], the Caption Hallucination Assessment with Image Relevance is another widely used metric for evaluating hallucinations in LVLMs
Dataset Splits	Yes	POPE samples 500 images from each dataset and selects three ground-truth objects per image as positive instances. Correspondingly, three absent objects are sampled per image using one of three strategies: random, popular, and adversarial sampling. Each object is then queried in the binary format: Is there a(n) {object} in the image? This yields 3,000 questions per dataset, with a balanced distribution of positive and negative instances.
Hardware Specification	No	No specific hardware details (like GPU or CPU models, or memory) are provided in the paper for the experiments.
Software Dependencies	No	The paper does not specify versions for any key software components or libraries used in the experiments.
Experiment Setup	Yes	Unless otherwise specified, we set the scaling factors to γ+ = 2.0 and γ = 0.0. For short-answering tasks, we set ξ = 20 and ζ = 10, while for open-ended tasks, we set ξ = 40 and ζ = 50. For MME, we employ the same attention heads identified from POPE. See Appendix A for more details.