Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity

Authors: Seongheon Park, Sharon Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We comprehensively benchmark existing object hallucination detection methods and demonstrate that GLSIM achieves superior detection performance, outperforming competitive baselines by a significant margin1. We extensively evaluate GLSIM across multiple benchmark datasets and LVLMs, including LLa VA1.5 [1], Mini GPT-4 [3], and Shikra [31], demonstrating strong generalization and state-of-the-art performance in detecting object hallucinations.
Researcher Affiliation	Academia	Seongheon Park Sharon Li Department of Computer Sciences University of Wisconsin-Madison EMAIL
Pseudocode	No	The paper describes its methodology using prose and mathematical equations (e.g., Equations 1-5) and provides detailed descriptions of the components. However, it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block or figure.
Open Source Code	Yes	1Code is available at https://github.com/deeplearning-wisc/glsim
Open Datasets	Yes	We utilize the MSCOCO dataset [43], which is widely adopted as the primary evaluation benchmark in numerous LVLM object hallucination studies and contains 80 object classes. In addition, we employ the Objects365 dataset [44], which offers a more diverse set of images and a larger category set comprising 365 object classes, along with denser object annotations per image.
Dataset Splits	Yes	For evaluation, we randomly sample 5,000 images each from the validation sets of MSCOCO and Objects365.
Hardware Specification	Yes	All experiments are conducted using Python 3.11.11 and Py Torch 2.6.0 [52], on a single NVIDIA A6000 GPU with 48GB of memory.
Software Dependencies	Yes	All experiments are conducted using Python 3.11.11 and Py Torch 2.6.0 [52]
Experiment Setup	Yes	The layer indices (l, l ), the number of selected patches K, and the weighting parameter w used for computing the final score are selected based on a separate validation set, as detailed in Table 5. For multi-token objects, we use the first token to compute the scores and consider the first occurrence of each object for hallucination detection. The total number of generated objects is shown in Table 6. For all experiments, we report the average over three different random seeds. Model Hyperparameters Layer indices K w LLa VA-1.5-7b (32, 31) 32 0.6 LLa VA-1.5-13b (40, 38) 32 0.6 Mini GPT-4 (32, 30) 4 0.5 Shikra (30, 27) 16 0.6