Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Discovering Compositional Hallucinations in LVLMs

Authors: Sibei Yang, Ge Zheng, Jiajin Tang, Jiaye Qian, Hanzhuo Huang, Cheng Shi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through an preliminary analysis, we present two key findings: (1) visual abstraction fails under compositional questioning, and (2) visual inputs induce degradation in language processing, leading to hallucinations. To facilitate future research on this phenomenon, we introduce a custom benchmark, SCBench, and propose a novel VLR-distillation method, which serves as the first baseline to effectively mitigate SCHall. Furthermore, experiment results on publicly available benchmarks, including both hallucination-specific and general-purpose ones, demonstrate the effectiveness of our VLR-distillation method.
Researcher Affiliation	Academia	1School of Computer Science and Engineering, Sun Yat-sen University 2Shanghai Tech University 3School of Computing and Data Science, The University of Hong Kong
Pseudocode	No	The paper describes the VLR-Distillation method using descriptive text and mathematical equations in Section 4, but does not include a clearly labeled pseudocode block or algorithm.
Open Source Code	No	Answer: [No] Justification: Code will be released after acceptance.
Open Datasets	Yes	We collect images from established datasets, including MMBench [41], MME [12] and SEEDBench [31], as well as from various online sources. ... For inference, we first conduct experiments on our proposed SCBench, comparing our methods with popular hallucination mitigating methods, to demonstrate the effectiveness of our proposed VLR-distillation. Additionally, we report results on popular hallucination benchmarks including POPE [33], MME-hall [12] and general-pupose VQA benchmarks encompassing Science QA [43], MMBench [41], Hallusion Bench [14] and MM-Vet [71].
Dataset Splits	No	The paper mentions using "a subset of the training data from the instruction tuning (IT) phase of these models" and then evaluates on "SCBench" and other "popular hallucination benchmarks" and "general-purpose VQA benchmarks." However, it does not provide explicit percentages, sample counts, or specific predefined splits for training, validation, or testing within the provided text.
Hardware Specification	Yes	All experiments are conducted for a single epoch, utilizing the Adam optimizer on 8 A100 GPUs.
Software Dependencies	No	The paper mentions "Adam optimizer" and training with "LoRA [19]" but does not specify version numbers for these or other software libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	For training, we have two training phases: pretraining stage for VLRs and distillation learning... During pretraining, we use a batch size of 128, freezing all other parts of the model and training only the VLRs. In the distillation learning phase, we employ a batch size of 64 with 2 accumulation steps, freezing the pretrained VLRs and training the Lo RA [19] of the language model. For each baseline, we set the number of VLRs N to 4. All experiments are conducted for a single epoch, utilizing the Adam optimizer on 8 A100 GPUs. ... Table 10: Hyperparameters for our VLR-distillation methods. a1, a2 and a3 are the coefficients for Lreg, L reg and LKL, respectively. batch size 128 lr 2e-4 1e-5 3e-5 lr schedule Cosine Decay lr warmup ratio 0.03 0.01 0.05 weight decay 0 0.05 0.05 epoch 1 optimizer Adam W Deep Speed stage 3 3 /