Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats

Authors: Jiaye Qian, Ge Zheng, Yuchen Zhu, Sibei Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types. ... 5 Experiments
Researcher Affiliation Academia 1School of Computer Science and Engineering, Sun Yat-sen University 2Shanghai Tech University
Pseudocode Yes Algorithm 1: Get Scores Set for Short Answering Questions ... Algorithm 2: Get Scores Set for Open-ended Questions
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Code will be released after acceptance.
Open Datasets Yes POPE [29], the Polling-based Object Probing Evaluation (POPE) benchmark is a widely utilized benchmark for evaluating object hallucination in LVLMs. Built upon three established datasets, COCO [33], A-OKVQA [46], and GQA [22] ... CHAIR [45], the Caption Hallucination Assessment with Image Relevance is another widely used metric for evaluating hallucinations in LVLMs
Dataset Splits Yes POPE samples 500 images from each dataset and selects three ground-truth objects per image as positive instances. Correspondingly, three absent objects are sampled per image using one of three strategies: random, popular, and adversarial sampling. Each object is then queried in the binary format: Is there a(n) {object} in the image? This yields 3,000 questions per dataset, with a balanced distribution of positive and negative instances.
Hardware Specification No No specific hardware details (like GPU or CPU models, or memory) are provided in the paper for the experiments.
Software Dependencies No The paper does not specify versions for any key software components or libraries used in the experiments.
Experiment Setup Yes Unless otherwise specified, we set the scaling factors to γ+ = 2.0 and γ = 0.0. For short-answering tasks, we set ξ = 20 and ζ = 10, while for open-ended tasks, we set ξ = 40 and ζ = 50. For MME, we employ the same attention heads identified from POPE. See Appendix A for more details.